# Policy gradients for MDPs, demonstrated via a simple Pac-Man game

The agent moves within a maze and tries to reach an exit field with a positive reward, while not being captured by the ghost in the maze. The agent's locomotion rules agree with *plain_maze*, and the ghost moves randomly by one field per step, with preferred direction towards the agent. The state space consists of all possible agent and ghost locations, plus a final "game over" state.

In [1]:
import numpy as np

In [2]:
# policy gradient iteration
from pg import policy_gradient_iteration
# corresponding network
from policy_net import PolicyNet

In [3]:
# environment
from env import MazeGhostEnv

In [4]:
# iterative algorithms for reference calculation
import mdp

In [5]:
# show description of maze (grid world) geometry
with open('maze_geometry2.txt', 'r') as file:
    for line in file:
        print(line, end='')

# S: start, X: inaccessible, E: exit with reward +1, F: exit with reward -1
..XE
.X..
.X.F
S...


In [6]:
# "reward" on regular fields
r = -0.04

In [7]:
# define environment
e = MazeGhostEnv('maze_geometry2.txt', r)

In [8]:
# discount factor
gamma = 0.99

In [9]:
# policy neural network
net = PolicyNet(e.observation(0).size, e.num_actions)

In [10]:
print('starting policy gradient iteration...')
net = policy_gradient_iteration(net, e, gamma)

starting policy gradient iteration...
episode 1000 completed, nsteps: 32, total discounted reward: -2.53539, running mean: -1.73258
episode 2000 completed, nsteps: 3, total discounted reward: -2.0398, running mean: -1.56868
episode 3000 completed, nsteps: 14, total discounted reward: -2.24496, running mean: -1.42507
episode 4000 completed, nsteps: 4, total discounted reward: -1.0891, running mean: -1.33853
episode 5000 completed, nsteps: 1, total discounted reward: -1, running mean: -1.24636
episode 6000 completed, nsteps: 3, total discounted reward: -1.0597, running mean: -1.17593
episode 7000 completed, nsteps: 4, total discounted reward: -2.0594, running mean: -1.08287
episode 8000 completed, nsteps: 2, total discounted reward: -2.02, running mean: -1.01093
episode 9000 completed, nsteps: 14, total discounted reward: -2.24496, running mean: -0.947025
frames of episode 10000:
step 0
reward: -0.04
░ ░ █ E
░ █ ░ ░
░ █ G F
░ ░ ░ ☺
action: →
_____________
step 1
reward: -0.04
░ ░ █ E
░ █

episode 68000 completed, nsteps: 4, total discounted reward: -2.0594, running mean: -0.420423
episode 69000 completed, nsteps: 14, total discounted reward: 0.387605, running mean: -0.453154
frames of episode 70000:
step 0
reward: -0.04
░ ░ █ E
░ █ ☺ ░
░ █ ░ F
░ ░ ░ G
action: →
_____________
step 1
reward: -0.04
░ ░ █ E
░ █ ░ ☺
░ █ ░ F
░ ░ G ░
action: ↑
_____________
step 2
reward: -0.04
░ ░ █ E
░ █ ☺ ░
░ █ ░ F
░ G ░ ░
action: →
_____________
step 3
reward: -0.04
░ ░ █ E
░ █ ░ ☺
░ █ ░ F
░ ░ G ░
action: ↑
_____________
step 4
reward: 1.0
░ ░ █ ☺
░ █ ░ ░
░ █ ░ F
░ ░ ░ G
action: ↑
_____________
Game over!
episode 70000 completed, nsteps: 5, total discounted reward: 0.80298, running mean: -0.389744
episode 71000 completed, nsteps: 1, total discounted reward: -1, running mean: -0.386064
episode 72000 completed, nsteps: 15, total discounted reward: -2.26251, running mean: -0.445504
episode 73000 completed, nsteps: 1, total discounted reward: 1, running mean: -0.421882
episode 74000 completed,

In [11]:
# obtain policy from network: most likely action for each state
pol = np.zeros(e.num_states, dtype=int)
for s in range(e.num_states):
    x = e.observation(s).reshape(-1)
    aprob = net.evaluate(x[None, :])[0]
    pol[s] = np.argmax(aprob)

In [12]:
# reference optimal policy
pref = mdp.policy_iteration(e.tprob, e.rewards, gamma)
# corresponding value function
uref = mdp.utility_from_policy(e.tprob, e.rewards, gamma, pref)
# omit "game over" from average
umean = np.mean(uref[:-1])
print('optimal value function average:', umean)

policy iteration completed after 4 iterations
optimal value function average: -0.28615611539027547


In [13]:
# compare policy with reference
print('policy (most likely action, for all possible ghost locations):')
print(e.draw_policy(pol))
print('number of deviations from reference:', np.sum((pol - pref) * e.pmask != 0))

policy (most likely action, for all possible ghost locations):
↓ ← █ E
↓ █ → ↑
↓ █ ↑ F
G → ↑ ←

↓ ← █ E
↓ █ → ↑
↓ █ ↑ F
→ G ↑ ←

↓ ← █ E
↓ █ → ↑
↓ █ ↑ F
→ → G ←

↓ ↓ █ E
↓ █ → ↑
↓ █ ↑ F
→ → ↑ G

← ← █ E
← █ → ↑
G █ ↑ F
→ → ↑ ←

↓ ← █ E
↓ █ → ↑
↓ █ G F
→ → ↑ ←

↓ ← █ E
↓ █ → ↑
↓ █ ↑ G
→ → ↑ ←

↓ ← █ E
G █ → ↑
↓ █ ↑ F
→ → ↑ ←

↓ ← █ E
↓ █ G ↑
↓ █ ↑ F
→ → ↑ ←

↓ ← █ E
↓ █ → G
↓ █ ← F
→ → ↑ ←

G ← █ E
↓ █ → ↑
↓ █ ↑ F
→ → ↑ ←

↓ G █ E
↓ █ → ↑
↓ █ ↑ F
→ → ↑ ←

↓ ← █ G
↓ █ → ↑
↓ █ ↑ F
→ → ↑ ←
number of deviations from reference: 16
