# Policy Evaluation (Prediction)

The initial approximation, $v_0$, is chosen arbitrarily
(except that the terminal state, if any, must be given value 0), and each successive approximation is obtained by using the Bellman expectation equation for $v_{\pi}$ as an update rule:

\begin{equation*}
v_{k+1}(s) \doteq  {\mathbb{E}}_{\pi} [R_{t+1} + \gamma v_k(s')| S_t = s] = \sum_{a}^{} \pi(a|s) \sum_{s', r}^{} p(s', r \ |s, a) [r + \gamma v_k(s')]
\end{equation*}

for all $s \in S$. Clearly, $v_k = v_{\pi}$ is a fixed point for this update rule because the Bellman equation for $v_{\pi}$ assures us of equality in this case. Indeed, the sequence $v_k$  can be shown in general to converge to $v_{\pi}$ as $k \rightarrow \infty$ under the same conditions that guarantee the existence of $v_{\pi}$ . This algorithm is called iterative policy evaluation.

Formally, iterative policy evaluation converges only in the limit, but in practice it must be halted short of this. As soon as $\Delta = \max_{s \in S}|v_{k+1}(s)-v_k(s)|$ is less than threshold $\theta$ the evaluation terminates.

- Input $\pi$, the policy to be evaluated <br>
- a small threshold $\theta > 0$ determining accuracy of estimation Initialize $V(s)$, for all $s \in S+$, arbitrarily except that $V(\text{terminal}) = 0$ <br>
- Loop: <br>
    + $\Delta \leftarrow 0$
    + Loop for each$s \in S$
        + $v \leftarrow V(s)$
        + $V(s) \leftarrow \sum_{a}^{} \pi(a|s) \sum_{s', r}^{} p(s', r \ |s, a) [r + \gamma V(s')]$
        + $\Delta = max(\Delta, |v-V(s)|)$
<br>
- Until $\Delta > \theta$

### Example 4.1 Grid World

In [4]:
import numpy as np
np.set_printoptions(formatter={'float': lambda x: "{0:0.1f}".format(x)})

actions = [(0, -1), (0, 1), (1, 0), (-1, 0)]
gamma = 1
theta = 0.01
policy = 0.25

def p(state, action):
    
    next_state = (state[0] + action[0], state[1] + action[1])
    if ((next_state[0] >= 0 and next_state[0] < 4) and
        (next_state[1] >= 0 and next_state[1] < 4)):
        return next_state, -1
    return state, -1 # wall
    
v = np.zeros((4, 4))
m = 0
while True:
    delta = 0
    for i in range(4):
        for j in range(4):
            if (i,j) == (0,0) or (i,j) == (3,3):
                continue
            v_prev = v[i, j]
            tmp = 0
            for a in actions:
                n_s, r = p((i, j), a)
                tmp += policy * (r + gamma * v[n_s[0], n_s[1]])
            v[i, j] = tmp
            delta = max(delta, abs(v[i, j]-v_prev))
    if delta < theta:
        break
print(np.round(v))

[[0.0 -14.0 -20.0 -22.0]
 [-14.0 -18.0 -20.0 -20.0]
 [-20.0 -20.0 -18.0 -14.0]
 [-22.0 -20.0 -14.0 0.0]]


#### Exercise 4.1 In Example 4.1, if $\pi$ is the equiprobable random policy, what is $q_{\pi}(11,down)$? What is $q_{\pi}(7,down)$?

Answer : $r(s,a,s')=-1$ for all s,s' and actions, and this is an undiscounted episodic task. values of $v_{\infty}$ are from figure 4.2 in book.

\begin{align*}
q_{\pi}(s, a) = \sum_{s', r}^{} p(s',r{\space}|s, a)[r(s,a,s') + {\gamma}v_{\pi}(s')]
\end{align*}
\begin{align*}
q_{\pi}(11, \text{down}) = -1 + 1\times0 = -1
\end{align*}
\begin{align*}
q_{\pi}(7, \text{down}) = -1 + 1\times(-14) = -15
\end{align*}

#### Exercise 4.2 In Example 4.1, suppose a new state 15 is added to the gridworld just below state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively. Assume that the transitions from the original states are unchanged. What, then, is $v_{\pi}(15)$ for the equiprobable random policy? Now suppose the dynamics of state 13 are also changed, such that action down from state 13 takes the agent to the new state 15. What is $v_{\pi}(15)$ for the equiprobable random policy in this case?

Answer:

- unchanged

\begin{align*}
v_0(15) = 0
\end{align*}
\begin{align*}
v_{1}(15) = 0.25 \times [(-21) + (-23) + (-15) + (-1)] = -15.0
\end{align*}
\begin{align*}
v_{2}(15) = 0.25 \times [(-21) + (-23) + (-15) + (-16)] = -18.75
\end{align*}
\begin{align*}
...
\end{align*}
\begin{align*}
v_{\infty}(15) = -20
\end{align*}

#### Exercise 4.3 What are the equations analogous to (4.3), (4.4), and (4.5) for the action- valuefunction $q_{\pi}$ and its successive approximation by a sequence of functions $q_0$,$q_1$,$q_2$,...?

Answer:

\begin{align*}
q_{k+1}(s, a) = \sum_{s', r}^{} p(s',r{\space}|s, a)[r(s,a,s') + {\gamma}\sum_{a'}^{}\pi(a|s)q_{k}(s', a')]
\end{align*}

# Policy Improvement

Now that we know $v_{\pi}$, let's try to find better policies. We can choose an action a in some state s that is not in our policy and thereafter following the existing policy, $\pi$ (we call this new policy $\pi'$). If this choice is better than following $\pi$ all the  time ($q_{\pi}(s,\pi'(s)) \ge v_{\pi}(s)$) than we can say that this new policy obtain greater or equal expected return from all states $s \in S$.

<hr>

<hr>

# Policy Iteration

### Example 4.2 Jack’s Car Rental

#### Exercise 4.4 The policy iteration algorithm on page 80 has a subtle bug in that it may never terminate if the policy continually switches between two or more policies that are equally good. This is ok for pedagogy, but not for actual use. Modify the pseudocode so that convergence is guaranteed.

#### Exercise 4.5 How would policy iteration be defined for action values? Give a complete algorithm for computing q⇤, analogous to that on page 80 for computing v⇤. Please pay special attention to this exercise, because the ideas involved will be used throughout the rest of the book.

#### Exercise 4.6 Suppose you are restricted to considering only policies that are "-soft, meaning that the probability of selecting each action in each state, s, is at least "/|A(s)|. Describe qualitatively the changes that would be required in each of the steps 3, 2, and 1, in that order, of the policy iteration algorithm for v⇤ on page 80.

#### Exercise 4.7 (programming) Write a program for policy iteration and re-solve Jack’s car rental problem with the following changes. One of Jack’s employees at the first location rides a bus home each night and lives near the second location. She is happy to shuttle one car to the second location for free. Each additional car still costs \$2, as do all cars moved in the other direction. In addition, Jack has limited parking space at each location. If more than 10 cars are kept overnight at a location (after any moving of cars), then an additional cost of $4 must be incurred to use a second parking lot (independent of how many cars are kept there). These sorts of nonlinearities and arbitrary dynamics often occur in real problems and cannot easily be handled by optimization methods other than dynamic programming. To check your program, first replicate the results given for the original problem.

# Value Iteration

### Example 4.3 Gambler’s Problem

#### Exercise 4.8 Why does the optimal policy for the gambler’s problem have such a curious form? In particular, for capital of 50 it bets it all on one flip, but for capital of 51 it does not. Why is this a good policy?

#### Exercise 4.9 (programming) Implement value iteration for the gambler’s problem and solve it for ph = 0.25 and ph = 0.55. In programming, you may find it convenient to introduce two dummy states corresponding to termination with capital of 0 and 100, giving them values of 0 and 1 respectively. Show your results graphically, as in Figure 4.3. Are your results stable as ✓ ! 0?

#### Exercise 4.10 What is the analog of the value iteration update (4.10) for action values, qk+1(s, a)?

# Asynchronous Dynamic Programming

# Generalized Policy Iteration

# Efficiency of Dynamic Programming