This code is for exercise 2 of chapter 4. We are asked to add a new state to the gridworld given in example 4.1 and work out the value of the state given an equirandom policy.

We first need to set up the environment. I'll write code for all the functions required, even when it seems redundant, just to mirror the pseudocode for iterative policy evaluation as closely as possible.  

In [18]:
nonterminal_states = [i for i in range(1, 16)]
terminal_states = [-1]
rewards = [-1] # The reward for every transition is -1

# These are the actions. Their number represents the change
# in the state number if a state transition occurs 
# (e.g. going down from 1 to 5 is a change of 4)
# This makes it easier to define state transitions
LEFT = -1
RIGHT = 1
UP = -4
DOWN = 4

actions = [LEFT, RIGHT, UP, DOWN]

# returns a probability of carrying # out an action in a state
def policy(action, state): 
    if action not in actions or state not in nonterminal_states:
        print("invalid action or state")
        return
    return 0.25 # equirandom

# returns a new state given a state and an action
def new_state(action, state): 
    if action not in actions or state not in nonterminal_states:
        print("invalid action or state")
        return
    if (state, action) in [(1, LEFT), (4, UP), (11, DOWN), (14, RIGHT)]:
        return -1
    # We must account for all the cases where an action does not change the state:
    if action == UP and state in [1, 2, 3]:
        return state
    if action == RIGHT and state in [3, 7, 11]:
        return state
    if action == DOWN and state in [12, 14]:
        return state
    if action == LEFT and state in [4, 8, 12]: 
        return state
    if action == DOWN and state == 13:
        return 15
    if state == 15:
        return new_state(action, 13) # same transitions in state 15 as state 13.
    return state + action # For all other cases, just perform a regular state transition

We now implement the policy evaluation function. We begin by setting up a mapping from states to values. 

In [19]:
def evaluate(policy, accuracy, iterations = None):
    # Mapping from states to values. Initialised at 0 for all states.
    value = dict()
    for state in nonterminal_states + terminal_states:
        value[state] = 0

    difference = accuracy
    i = 0
    while difference >= accuracy and (iterations == None or i < iterations):
        difference = 0
        for s in nonterminal_states:
            s_value = value[s]
            # For each action, get the new state's value, and calculate the new value
            value[s] = sum([policy(a, s) * (-1 + value[new_state(a, s)]) for a in actions])
            difference = max(difference, abs(s_value - value[s]))
        i += 1

    return value


Let's now run an evaluation:

In [27]:
values = evaluate(policy, 0.001)
for s in values:
    print(s, round(values[s], 1))

1 -14.0
2 -20.0
3 -22.0
4 -14.0
5 -18.0
6 -20.0
7 -20.0
8 -20.0
9 -20.0
10 -18.0
11 -14.0
12 -22.0
13 -20.0
14 -14.0
15 -20.0
-1 0
