# TD Prediction

In this demo we estimate the value function for a given policy using Temporal difference method.

In [1]:
import numpy as np
from gridworld import GridWorld

np.set_printoptions(precision=3,suppress=True)

When we are at state s_t we take an action based on the current observation,we get an reward and move to the next state S_t+1 and also get a new_observation. Using this we look up the value function and compute the reward for state s_t.

In [2]:
def update_value_matrix(value_matrix, observation, new_observation, 
                   reward, alpha, gamma):
    
    u = value_matrix[observation[0], observation[1]]
    u_t1 = value_matrix[new_observation[0], new_observation[1]]
    value_matrix[observation[0], observation[1]] += \
        alpha * (reward + gamma * u_t1 - u)
    return value_matrix

Lets create a gridworld of size (3,4). The terminal states are noted as 1 and others as zero. The agent has four possible actions (UP,DOWN,LEFT,RIGHT) and gets an reward of -0.04 at all states. One of the terminal states has a reward of +1 and other -1 (not favored). Value functions represent how long does it take for an agent to reach the terminal state.

In [3]:
env = GridWorld(3, 4)

#Define the state matrix
state_matrix = np.zeros((3,4))
state_matrix[0, 3] = 1
state_matrix[1, 3] = 1
state_matrix[1, 1] = -1
print("State Matrix:")
print(state_matrix)

State Matrix:
[[ 0.  0.  0.  1.]
 [ 0. -1.  0.  1.]
 [ 0.  0.  0.  0.]]


In [4]:
#Define the reward matrix
reward_matrix = np.full((3,4), -0.04)
reward_matrix[0, 3] = 1
reward_matrix[1, 3] = -1
print("Reward Matrix:")
print(reward_matrix)

Reward Matrix:
[[-0.04 -0.04 -0.04  1.  ]
 [-0.04 -0.04 -0.04 -1.  ]
 [-0.04 -0.04 -0.04 -0.04]]


In [5]:
#Define the transition matrix
transition_matrix = np.array([[0.8, 0.1, 0.0, 0.1],
                              [0.1, 0.8, 0.1, 0.0],
                              [0.0, 0.1, 0.8, 0.1],
                              [0.1, 0.0, 0.1, 0.8]])

#Define the policy matrix
#This is the optimal policy for world with reward=-0.04
policy_matrix = np.array([[1,      1,  1,  -1],
                          [0, np.NaN,  0,  -1],
                          [0,      3,  3,   3]])
print("Policy Matrix:")
print(policy_matrix)

env.setStateMatrix(state_matrix)
env.setRewardMatrix(reward_matrix)
env.setTransitionMatrix(transition_matrix)

value_matrix = np.zeros((3,4))
gamma = 0.999
alpha = 0.1 #constant step size
tot_epoch = 30000
print_epoch = 1000

for epoch in range(tot_epoch):
    #Reset and return the first observation
    observation = env.reset(exploring_starts=True)
    for step in range(1000):
        #Take the action from the action matrix
        action = policy_matrix[observation[0], observation[1]]
        #Move one step in the environment and get obs and reward
        new_observation, reward, done = env.step(action)
        value_matrix = update_value_matrix(value_matrix, observation, 
                                        new_observation, reward, alpha, gamma)
        observation = new_observation
        
        if done: break

    if(epoch % print_epoch == 0):
        print("")
        print("Value matrix after " + str(epoch+1) + " iterations:") 
        print(value_matrix)
#Time to check the value matrix obtained
print("Value matrix after " + str(tot_epoch) + " iterations:")
print(value_matrix)

Policy Matrix:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]

Value matrix after 1 iterations:
[[-0.004 -0.004  0.096  0.   ]
 [-0.004  0.     0.     0.   ]
 [-0.004 -0.004 -0.004  0.   ]]

Value matrix after 1001 iterations:
[[0.872 0.927 0.916 0.   ]
 [0.825 0.    0.451 0.   ]
 [0.771 0.717 0.663 0.491]]

Value matrix after 2001 iterations:
[[0.853 0.914 0.978 0.   ]
 [0.79  0.    0.859 0.   ]
 [0.733 0.687 0.69  0.43 ]]

Value matrix after 3001 iterations:
[[0.881 0.932 0.982 0.   ]
 [0.825 0.    0.5   0.   ]
 [0.762 0.705 0.651 0.579]]

Value matrix after 4001 iterations:
[[0.869 0.926 0.964 0.   ]
 [0.819 0.    0.702 0.   ]
 [0.757 0.691 0.642 0.364]]

Value matrix after 5001 iterations:
[[0.89  0.938 0.994 0.   ]
 [0.834 0.    0.93  0.   ]
 [0.783 0.721 0.66  0.207]]

Value matrix after 6001 iterations:
[[0.861 0.926 0.992 0.   ]
 [0.794 0.    0.391 0.   ]
 [0.734 0.696 0.623 0.327]]

Value matrix after 7001 iterations:
[[0.852 0.885 0.951 0.   ]
 [0.809 0.    0.784 0