# Predicting Rewards with the Action Value Function

This experiment is very similar to the one about the state value function. I recommend that you read that first. 

The action-value function is a view of the expected return with respect to a given state and action choice. The action represents an extra dimension over and above the state-value function. The premise is the same, but this time you need to iterate over all actions as well as all states. The equation is also similar, with the extra addition of an action, $a$:

$$ Q_{\pi}(s, a) \doteq \mathbb{E}_{\pi}[ G \vert s, a ] = \mathbb{E}_{\pi}\bigg[ \sum^{T}_{k=0} \gamma^k r_{k} \vert s, a \bigg] $$

Let's run the same experiment (as the state-value function experiment) again to see what the differences are.

## The Environment: A Simple Grid World

The first lot of code is exactly the same as before.

In [1]:
starting_position = 1 # The starting position
cliff_position = 0 # The cliff position
end_position = 5 # The terminating state position
reward_goal_state = 5 # Reward for reaching goal
reward_cliff = 0 # Reward for falling off cliff

def reward(current_position) -> int:
    if current_position <= cliff_position:
        return reward_cliff
    if current_position >= end_position:
        return reward_goal_state
    return 0

def is_terminating(current_position) -> bool:
    if current_position <= cliff_position:
        return True
    if current_position >= end_position:
        return True
    return False

## The Agent

The agent is also exactly the same.

In [2]:
def strategy() -> int:
    if np.random.random() >= 0.5:
        return 1 # Right
    else:
        return -1 # Left

## The Experiment

However, here's where it differs. First off, there's far more exploration to do, because we're not only iterating over states, but also actions. You'll need to run this for longer before it converges.

Also, we're going to have to store both the states and the actions in the buffer.

In [4]:
import numpy as np
np.random.seed(42)

# Global buffers to perform averaging later
# Second dimension is the actions
value_sum = np.zeros((end_position + 1, 2))
n_hits = np.zeros((end_position + 1, 2))

# A helper function to map the actions to valid buffer indices
def action_value_mapping(x): return 0 if x == -1 else 1


n_iter = 10001
for i in range(n_iter):
    position_history = [] # A log of positions in this episode
    current_position = starting_position # Reset
    current_action = strategy()
    while True:
        # Append position to log
        position_history.append((current_position, current_action))

        if is_terminating(current_position):
            break
        
        # Update current position according to strategy
        current_position += strategy()

    # Now the episode has finished, what was the reward?
    current_reward = reward(current_position)
    
    # Now add the reward to the buffers that allow you to calculate the average
    for pos, act in position_history:
        value_sum[pos, action_value_mapping(act)] += current_reward
        n_hits[pos, action_value_mapping(act)] += 1
        
    # Now calculate the average for this episode and print
    expect_return_0 = ', '.join(
        f'{q:.2f}' for q in value_sum[:, 0] / n_hits[:, 0])
    expect_return_1 = ', '.join(
        f'{q:.2f}' for q in value_sum[:, 1] / n_hits[:, 1])
    if i%1000==0:
        print("[{}] Average reward: [{} ; {}]".format(i, expect_return_0, expect_return_1))

  f'{q:.2f}' for q in value_sum[:, 0] / n_hits[:, 0])
  f'{q:.2f}' for q in value_sum[:, 1] / n_hits[:, 1])


[0] Average reward: [nan, 5.00, 5.00, 5.00, 5.00, 5.00 ; nan, nan, nan, nan, nan, nan]
[1000] Average reward: [0.00, 1.15, 2.30, 3.21, 4.03, 5.00 ; 0.00, 0.81, 1.70, 2.79, 4.01, 5.00]
[2000] Average reward: [0.00, 1.07, 2.10, 2.97, 3.81, 5.00 ; 0.00, 0.75, 1.55, 2.50, 3.62, 5.00]
[3000] Average reward: [0.00, 1.07, 2.10, 3.02, 3.93, 5.00 ; 0.00, 0.79, 1.60, 2.49, 3.59, 5.00]
[4000] Average reward: [0.00, 1.04, 2.05, 2.96, 3.88, 5.00 ; 0.00, 0.93, 1.82, 2.76, 3.80, 5.00]
[5000] Average reward: [0.00, 1.04, 2.06, 3.03, 3.98, 5.00 ; 0.00, 0.97, 1.88, 2.84, 3.84, 5.00]
[6000] Average reward: [0.00, 1.04, 2.06, 3.01, 3.97, 5.00 ; 0.00, 0.97, 1.90, 2.86, 3.84, 5.00]
[7000] Average reward: [0.00, 1.06, 2.08, 3.04, 4.02, 5.00 ; 0.00, 0.97, 1.90, 2.85, 3.84, 5.00]
[8000] Average reward: [0.00, 1.05, 2.08, 3.03, 3.98, 5.00 ; 0.00, 0.96, 1.91, 2.89, 3.88, 5.00]
[9000] Average reward: [0.00, 1.03, 2.03, 2.99, 3.96, 5.00 ; 0.00, 0.95, 1.89, 2.89, 3.89, 5.00]
[10000] Average reward: [0.00, 1.00, 1.9

In [5]:
([0.00, 1.00, 1.98, 2.93, 3.93, 5.00]+ [0.00, 0.96, 1.90, 2.90, 3.91, 5.00])/2

TypeError: unsupported operand type(s) for /: 'list' and 'int'