# Policy gradients for MDPs, demonstrated via a grid world toy model

The agent moves within a maze and tries to reach an exit field with a positive reward, while not being captured by the ghost in the maze. The agent's locomotion rules agree with *plain_maze*, and the ghost moves randomly by one field per step, with preferred direction towards the agent. The state space consists of all possible agent and ghost locations, plus a final "game over" state.

In [1]:
import numpy as np

In [2]:
# policy gradient iteration
from pg import policy_gradient_iteration
# corresponding network
from policy_net import PolicyNet

In [3]:
# environment
from env import MazeEnv

In [4]:
# iterative algorithms for reference calculation
import mdp

In [5]:
# show description of maze (grid world) geometry
with open('maze_geometry.txt', 'r') as file:
    for line in file:
        print(line, end='')

# S: start, X: inaccessible, E: exit with reward +1, F: exit with reward -1
...E
.X.F
S...


In [6]:
# "reward" on regular fields
r = -0.04

In [7]:
# define environment
e = MazeEnv('maze_geometry.txt', r)

In [8]:
# discount factor
gamma = 0.99

In [9]:
# policy neural network
net = PolicyNet(e.observation(0).size, e.num_actions)

In [10]:
print('starting policy gradient iteration...')
net = policy_gradient_iteration(net, e, gamma, nepisodes=50000)

starting policy gradient iteration...
episode 500 completed, nsteps: 1, total discounted reward: -1, running mean: -0.997288
episode 1000 completed, nsteps: 18, total discounted reward: -1.47117, running mean: -0.716186
episode 1500 completed, nsteps: 3, total discounted reward: -1.0597, running mean: -0.544335
episode 2000 completed, nsteps: 4, total discounted reward: 0.851495, running mean: -0.364709
episode 2500 completed, nsteps: 12, total discounted reward: -1.31399, running mean: -0.193828
episode 3000 completed, nsteps: 4, total discounted reward: 0.851495, running mean: -0.0585713
episode 3500 completed, nsteps: 8, total discounted reward: 0.660327, running mean: 0.0569061
episode 4000 completed, nsteps: 1, total discounted reward: -1, running mean: 0.123769
episode 4500 completed, nsteps: 10, total discounted reward: -1.25945, running mean: 0.1676
frames of episode 5000:
step 0
reward: -0.04
░ ░ ░ E
░ █ ░ F
░ ☺ ░ ░
action: →
_____________
step 1
reward: -0.04
░ ░ ░ E
░ █ ░ F


episode 30500 completed, nsteps: 4, total discounted reward: 0.851495, running mean: 0.504695
episode 31000 completed, nsteps: 8, total discounted reward: 0.660327, running mean: 0.505097
episode 31500 completed, nsteps: 5, total discounted reward: 0.80298, running mean: 0.514809
episode 32000 completed, nsteps: 1, total discounted reward: 1, running mean: 0.513702
episode 32500 completed, nsteps: 2, total discounted reward: 0.95, running mean: 0.525957
episode 33000 completed, nsteps: 3, total discounted reward: 0.9005, running mean: 0.532842
episode 33500 completed, nsteps: 4, total discounted reward: 0.851495, running mean: 0.517434
episode 34000 completed, nsteps: 1, total discounted reward: 1, running mean: 0.521434
episode 34500 completed, nsteps: 2, total discounted reward: 0.95, running mean: 0.536683
frames of episode 35000:
step 0
reward: -0.04
░ ░ ░ E
☺ █ ░ F
░ ░ ░ ░
action: ↑
_____________
step 1
reward: -0.04
☺ ░ ░ E
░ █ ░ F
░ ░ ░ ░
action: →
_____________
step 2
reward: -

In [11]:
# obtain policy from network: most likely action for each state
pol = np.zeros(e.num_states, dtype=int)
for s in range(e.num_states):
    x = e.observation(s).reshape(-1)
    aprob = net.evaluate(x[None, :])[0]
    pol[s] = np.argmax(aprob)

In [12]:
# reference optimal policy
pref = mdp.policy_iteration(e.tprob, e.rewards, gamma)
# corresponding value function
uref = mdp.utility_from_policy(e.tprob, e.rewards, gamma, pref)
# omit "game over" from average
umean = np.mean(uref[:-1])
print('optimal value function average:', umean)

policy iteration completed after 3 iterations
optimal value function average: 0.5476936325612626


In [13]:
# compare policy with reference
print('policy (most likely action):')
print(e.draw_policy(pol))
print('policy reference solution:')
print(e.draw_policy(pref))
print('number of deviations from reference:', np.sum((pol - pref) * e.pmask != 0))

policy (most likely action):
→ → → E
↑ █ ↑ F
↑ → ↑ ←
policy reference solution:
→ → → E
↑ █ ↑ F
↑ ← ↑ ←
number of deviations from reference: 1
