# Iterative algorithms for MDPs, demonstrated via a grid world toy model

The agent moves within a maze (grid world) and tries to reach an exit field with a positive reward. The locomotion is probabilistic, e.g., if the agent chooses action ↑, he will actually move up with 80% probability, but has a 10% chance to end up in the field left or right of him, respectively. The rules for actions ↓, ←, → work analogously. For impossible movements (to inaccessible fields), the agent simply remains at the current location.

(Adapted from the textbook "Artificial Intelligence: A Modern Approach".)

In [1]:
import numpy as np

In [2]:
# iterative algorithms and utility functions for Markov Decision Processes (MDPs)
import mdp

In [3]:
# environment
from env import MazeEnv

In [4]:
# show description of maze (grid world) geometry
with open('maze_geometry.txt', 'r') as file:
    for line in file:
        print(line, end='')

# S: start, X: inaccessible, E: exit with reward +1, F: exit with reward -1
...E
.X.F
S...


In [5]:
# "reward" on regular fields
r = -0.04

In [6]:
# define environment
e = MazeEnv('maze_geometry.txt', r)

In [7]:
# discount factor
gamma = 0.99

In [8]:
# perform value iteration
u = mdp.value_iteration(e.tprob, e.rewards, gamma)
print('value function (r = {}, gamma = {}):'.format(r, gamma))
if hasattr(np, 'printoptions'):
    with np.printoptions(precision=3):
        print(e.maze_array(u))
else:
    print(e.maze_array(u))

value iteration with epsilon=1e-14 completed after 56 iterations
value function (r = -0.04, gamma = 0.99):
[[ 0.776  0.844  0.905  1.   ]
 [ 0.717    nan  0.641 -1.   ]
 [ 0.651  0.593  0.56   0.338]]


In [9]:
# optimal policy corresponding to u
pol = mdp.policy_from_utility(e.tprob, u)
print('optimal policy (r = {}, gamma = {}):'.format(r, gamma))
print(e.draw_policy(pol))

optimal policy (r = -0.04, gamma = 0.99):
→ → → E
↑ █ ↑ F
↑ ← ↑ ←


In [10]:
# consistency check
if gamma < 1:
    upol = mdp.utility_from_policy(e.tprob, e.rewards, gamma, pol)
    uerr = np.linalg.norm(upol - u)
    print('utility from policy consistency check error:', uerr)

utility from policy consistency check error: 3.554447978966673e-16


In [11]:
# alternative: policy iteration
if gamma < 1:
    pal = mdp.policy_iteration(e.tprob, e.rewards, gamma)
    palerr = np.linalg.norm((pal - pol) * e.pmask)
    print('policy iteration consistency check error:', palerr)

policy iteration completed after 3 iterations
policy iteration consistency check error: 0.0


In [12]:
# Q-value function
Q = mdp.q_iteration(e.tprob, e.rewards, gamma)
print(Q)

Q-value iteration with epsilon=1e-14 completed after 52 iterations
[[ 0.56476064  0.65066309  0.61068739  0.59841561]
 [ 0.52092694  0.54926123  0.59267477  0.54926123]
 [ 0.34666916  0.5600724   0.54833699  0.49571846]
 [ 0.1621969  -0.74308651  0.33804366  0.31664407]
 [ 0.66883065  0.71663212  0.66883065  0.61721832]
 [-0.68694834  0.64132736  0.61298293  0.36806875]
 [-1.         -1.         -1.         -1.        ]
 [ 0.77618555  0.7351309   0.72252791  0.68796458]
 [ 0.84393511  0.79484347  0.74183811  0.79484347]
 [ 0.9050959   0.85938553  0.78149251  0.65048085]
 [ 1.          1.          1.          1.        ]
 [ 0.          0.          0.          0.        ]]


In [13]:
# can obtain utility from Q-value function
uQ = mdp.utility_from_qvalue(Q)
uQerr = np.linalg.norm(u - uQ)
print('utility from Q-value consistency check error:', uQerr)

utility from Q-value consistency check error: 8.233634315563857e-16


In [14]:
# can obtain policy from Q-value function
pQ = mdp.policy_from_qvalue(Q)
pQerr = np.linalg.norm((pQ - pol) * e.pmask)
print('policy from Q-value consistency check error:', pQerr)

policy from Q-value consistency check error: 0.0


In [15]:
# play a game
e.play(pol, gamma)

step 0
reward: -0.04
░ ░ ░ E
░ █ ░ F
☺ ░ ░ ░
action: ↑
_____________
step 1
reward: -0.04
░ ░ ░ E
☺ █ ░ F
░ ░ ░ ░
action: ↑
_____________
step 2
reward: -0.04
☺ ░ ░ E
░ █ ░ F
░ ░ ░ ░
action: →
_____________
step 3
reward: -0.04
░ ☺ ░ E
░ █ ░ F
░ ░ ░ ░
action: →
_____________
step 4
reward: -0.04
░ ☺ ░ E
░ █ ░ F
░ ░ ░ ░
action: →
_____________
step 5
reward: -0.04
░ ░ ☺ E
░ █ ░ F
░ ░ ░ ░
action: →
_____________
step 6
reward: 1.0
░ ░ ░ ☺
░ █ ░ F
░ ░ ░ ░
action: →
_____________
Game over!
cumulative discounted reward (gamma = 0.99): 0.707400747005
