## Policy Evaluation by Direct Solution of Bellman Equation.

Let us consider an Agent that has to navigate a Grid World with 42 states as shown below. The Agent starts from the bottom left and has to reach the top right corner without stepping on the cells with negative reward. If the Agent falls into the shaded state it will bounce back to the previous state.<br>

![title](Grid_World.png)

 At any moment, Agent can take four actions - Up(^), Left(<), Down(V),right(>). <br> 


In [1]:
import numpy as np
T = np.load("./T.npy")

Let us assume that a Transition Model is given to the System. Below, we examine the Transition Probabilities of some states.<br>
 
 From the Terminal State (0,6), it does not move anywhere.

In [2]:
T.shape

(42, 42, 4)

Transition Probability of Terminal State is made to be zero for all actions.

 You can examine that the Agent moves to the intended state with 0.8 Probability and moves to random direction with 0.1 probability. <br>

In [3]:
def return_policy_eval_BellMan(p, r, T, gamma):
    "Solving the Bellman Equation directly"
    x = np.zeros(42)
    for s in range(42):
        if not np.isnan(p[s]):
            action = int(p[s])
            x[s] = np.linalg.solve(np.identity(42) - gamma*T[:,:,action], r)[s]
    return x

In [4]:
gamma = 0.999
iteration_it = 0

#Generate the first policy randomly
# Nan=Nothing, -1=Terminal, 0=Up, 1=Left, 2=Down, 3=Right
p = np.random.randint(0, 4, size=(42)).astype(np.float32)
p[5]=p[9]=p[21]=p[24]=p[37]=p[39] = np.NaN
p[6]=p[4]=p[17]=p[20]=p[26] = -1 #terminal states

#Value function initialised to zero
v = np.array([0.0, 0.0, 0.0,  0.0,0.0, 0.0, 0.0,
              0.0, 0.0, 0.0,  0.0,0.0, 0.0, 0.0,
              0.0, 0.0, 0.0,  0.0,0.0, 0.0, 0.0,
              0.0, 0.0, 0.0,  0.0,0.0, 0.0, 0.0,
              0.0, 0.0, 0.0,  0.0,0.0, 0.0, 0.0,
              0.0, 0.0, 0.0,  0.0,0.0, 0.0, 0.0])

#let us assign appropriate rewards
r = np.array([-0.04, -0.04, -0.04,  -0.04,  -1.0,   0.0,  +1.0,
              -0.04, -0.04,   0.0,  -0.04, -0.04, -0.04, -0.04,
              -0.04, -0.04, -0.04,   -1.0, -0.04, -0.04,  -1.0,
                0.0, -0.04, -0.04,    0.0, -0.04,  -1.0, -0.04,
              -0.04, -0.04, -0.04,  -0.04, -0.04, -0.04, -0.04,
              -0.04, -0.04,   0.0,  -0.04,   0.0, -0.04, -0.04])
unchanged = False
# while True:
for it in range(0,100):
    iteration_it +=1
    epsilon = 0.0001
    #1- Policy Evaluation
    v1 = v.copy()
    #Direct solution
    v = return_policy_eval_BellMan(p, r, T, gamma)

print("Iterations: " + str(iteration_it))
print("Gamma: " + str(gamma))
print("Epsilon: " + str(epsilon))
print("===================================================")
print("Estimated value function")
print(v.reshape(6,7))

Iterations: 100
Gamma: 0.999
Epsilon: 0.0001
Estimated value function
[[-16.57198381  -0.17304185  -0.12320832  -0.04        -0.28438129
    0.           1.10623233]
 [ -0.57844257  -0.04         0.          -0.04        -0.31388463
   -0.10832852  -0.59315493]
 [  0.2068669    0.30117132  -3.45777021  -2.99670873  -3.45316587
   -0.22855119  -1.13471744]
 [  0.          -8.70030528   2.18971829   0.          -3.12559507
   -2.22931184  -1.32977286]
 [  0.52564885  -4.44974174  -0.0649005   -1.57358793  -6.03749128
   -0.54859993  -0.08469095]
 [ -0.04952512   0.06296128   0.          -4.06561861   0.
    0.04064972  -0.05441301]]


-2.32186561  -1.01791823  -0.12320832  -0.04       -0.28438129  0.0           1.10623233
-2.3689515   -0.04         0.0         -0.04       -0.31388463 -0.06857593   -0.04284566
-5.78604803  -1.99191225  -3.45777021  -2.99670873 -1.5905784  -1.3114059    -1.13471744
 0.0         -0.10334749   2.18971829   0.0         1.03341728 -2.22931184   -1.32977286
-4.54138151  -0.09296731   0.54903187   0.45820533 -0.06467763 -0.09113937   -0.08469095
-1.12512904   0.06296128   0.0          0.08982413  0.0        -0.14084028   -0.05760154