# The Agent Environment Interface

The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. The agent and environment interact at each of a sequence of discrete time steps, At each time step $t$, the agent receives some representation of the environment’s state, $S_t \in \mathcal{S}$, and on that basis selects an action, $A_t \in \mathcal{A}(s)$. One time step later, in part as a consequence of its action, the agent receives a numerical reward, $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$, and finds itself in a new state, $S_{t+1}$. 

In a finite MDPs, we have a finite number of states, actions, and rewards. We define a discrete probability distribution over these characteristic aspects of MDP in a way that any new state $s' \in \mathcal{S}$ and reward $r \in \mathcal{R}$ is dependent on the preceding state $s \in \mathcal{S}$ and action $a \in \mathcal{A}$. (${\space}p : \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \to  [0,1] $)

$$p(s', r{\space}|s, a) \doteq Pr\{ S_t = s', R_t = r{\space} | S_{t-1} = s, A_{t-1} = a\},
$$

$$\sum_{s' \in \mathcal{S}}^{} \sum_{r \in \mathcal{R}}^{} p(s', r{\space}|s, a) = 1 {\space}{\space} for {\space}{\space} all {\space}{\space} s \in \mathcal{S}, {\space}{\space} a \in \mathcal{A}
$$

From this four-argument dynamics function, $p$, we can compute anything from our environment such as state-transition probabilities :

\begin{equation*}
p(s'{\space}|s, a) = \sum_{r \in \mathcal{R}}^{} p(s', r{\space}|s, a)
\end{equation*}

Expected reward from state-action pairs :

\begin{equation*}
r(s,a) = \sum_{r \in \mathcal{R}}^{} r \sum_{s' \in \mathcal{S}}^{} p(s', r{\space}|s, a)
\end{equation*}

And the expected reward from state-action-nextstate triples :

\begin{equation*}
r(s, a, s') = \sum_{r \in \mathcal{R}}^{} r {\space} \frac{p(s', r{\space}|s, a)}{p(s'{\space}|s, a)}
\end{equation*}

As said before, each finite MDP has three charactristics:

- States : represent the basis on which the choices are made
- Actions : represent the choices made by the agent.
- Rewards : represent what agent gets for choosing this action

#### Exercise 3.1 Devise three example tasks of your own that fit into the MDP framework, identifying for each its states, actions, and rewards. Make the three examples as different from each other as possible. The framework is abstract and flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples.

Answer : One example can be the goalkeeper in a soccer game. It has 2 states, one is guarding the goal against the opponent and the other is having the ball. There are 3 actions for this agent, such as moving, catching the ball, and throwing the ball. We can give the agent reward of +1 each time it catches the ball and -10 for not catching.

#### Exercise 3.2 Is the MDP framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?

Answer : Some goal-directed learning tasks may result in a vector of rewards instead of a scalar. If we can't reduce this set of reward to a scalar one, with the MDP definition given we can not deploy our learning framework.

#### Exercise 3.3 Consider the problem of driving. You could define the actions in terms of the accelerator, steering wheel, and brake, that is, where your body meets the machine. Or you could define them farther out—say, where the rubber meets the road, considering your actions to be tire torques. Or you could define them farther in—say, where your brain meets your body, the actions being muscle twitches to control your limbs. Or you could go to a really high level and say that your actions are your choices of where to drive. What is the right level, the right place to draw the line between agent and environment? On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice?

Answer : The right place to draw the line between agent and environment depends on the goal of our task, for example, if we want to control the way our car is interacting with outside world then we would define the agent to be our car. For any given task some information may be easier to obtain and may guide the modeler in selecting the level to work at, so it's not a free choice.

### Example 3.3 Recycling Robot

In [2]:
import numpy as np

class RR_Env():

    def __init__(self, alpha, beta, reward):
        self.alpha = alpha
        self.beta = beta
        self.reward = reward

    def getTransitions(self):

        sN = 2 # number of states
        aN = 3 # number of actions
        sM = {'low' : 0, 'high' : 1}
        aM = {'search' : 0, 'wait' : 1, 'recharge' : 2}

        trans = np.zeros((sN, aN, sN))
        trans[sM['high'], aM['search'], sM['high']] = self.alpha
        trans[sM['high'], aM['search'], sM['low']] = 1 - self.alpha
        trans[sM['low'], aM['search'], sM['high']] = 1 - self.beta
        trans[sM['low'], aM['search'], sM['low']] = self.beta
        trans[sM['high'], aM['wait'], sM['high']] = 1
        trans[sM['high'], aM['wait'], sM['low']] = 0
        trans[sM['low'], aM['wait'], sM['high']] = 0
        trans[sM['low'], aM['wait'], sM['low']] = 1
        trans[sM['low'], aM['recharge'], sM['high']] = 1
        trans[sM['low'], aM['recharge'], sM['low']] = 0

        return trans
    
    def getRewards(self):

        sN = 2 # number of states
        aN = 3 # number of actions
        sM = {'low' : 0, 'high' : 1}
        aM = {'search' : 0, 'wait' : 1, 'recharge' : 2}

        rewards = np.zeros((sN, aN, sN))
        rewards[sM['high'], aM['search'], sM['high']] = self.reward[0]
        rewards[sM['high'], aM['search'], sM['low']] = self.reward[0]
        rewards[sM['low'], aM['search'], sM['high']] = -3
        rewards[sM['low'], aM['search'], sM['low']] = self.reward[0]
        rewards[sM['high'], aM['wait'], sM['high']] = self.reward[1]
        rewards[sM['high'], aM['wait'], sM['low']] = None
        rewards[sM['low'], aM['wait'], sM['high']] = None
        rewards[sM['low'], aM['wait'], sM['low']] = self.reward[1]
        rewards[sM['low'], aM['recharge'], sM['high']] = 0
        rewards[sM['low'], aM['recharge'], sM['low']] = None

        return rewards


rrEnv = RR_Env(0.5, 0.4, [2, 1, 0]) #alpha=0.5 beta=0.4 r_search=2 r_wait=1 r_recharge=0
print(rrEnv.getTransitions())
print(rrEnv.getRewards())
        

[[[0.4 0.6]
  [1.  0. ]
  [0.  1. ]]

 [[0.5 0.5]
  [0.  1. ]
  [0.  0. ]]]
[[[ 2. -3.]
  [ 1. nan]
  [nan  0.]]

 [[ 2.  2.]
  [nan  1.]
  [ 0.  0.]]]


#### Exercise 3.4 Give a table analogous to that in Example 3.3, but for $p(s', r{\space}|s, a)$. It should have columns for $s, a, s', r,$ and $p(s',r{\space}|s,a)$, and a row for every 4-tuple for which $p(s',r{\space}|s,a) > 0$.

Answer : 

| s  | a  | s'  |  r  |  $p(s',r{\mid}s,a)$ |
|:-:|:-:|:-:|:-:|:-:|
| high  | search  | high  | $r_{search}$  | $\alpha$  |
| high  | search  | low  |  $r_{search}$ | 1-$\alpha$  |
| low  |  search |  high |  -3 | 1-$\beta$  |
| low  |  search |  low | $r_{search}$ | $\beta$  |
| high  | wait  | high  | $r_{wait}$  | 1  |
| low  | wait | low  | $r_{wait}$  | 1  |
| low  | recharge | high  | 0  | 1  |

# Goals and Rewards

In reinforcement learning, the purpose or goal of the agent is formalized in terms of a special signal, called the reward, passing from the environment to the agent. At each time step, the reward is a simple number, $R_t \in \mathbb{R}$. Informally, the agent’s goal is to maximize the total amount of reward it receives. This means maximizing not immediate reward, but cumulative reward in the long run.

# Returns and Episodes

In general, we seek to maximize the expected return, where the return, denoted $G_t$, is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards:

\begin{equation*}
G_t = R_{t+1} + R_{t+2} + ... + R_T
\end{equation*}

where $T$ is our final time step. This is the case which our environment can break into episodes that could end in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Each episode is independent of others. We call these kind of tasks, episodic tasks.

However, there are other kinds of tasks which do not break into identifiable episodes but goes on continually without limit. These are called continuing tasks. There is an issue with the return equation for these tasks. In continuing tasks $T = \infty$ and this will result in $G_t$ itself become infinity. To solve this problem we introduce discounting. With distcount rate $0 \leq \gamma \leq 1$ :

\begin{equation*}
G_t = R_{t+1} + \gamma R_{t+2} + {\gamma}^{2} R_{t+3} +{\space}... = \sum_{k = 0}^{\infty} {\gamma}^{k}R_{t+k+1}
\end{equation*}

In equation above, a reward received k time steps in the future is worth only ${\gamma}^{k-1}$ times to a situation when it was received immediately. If ${\gamma} < 1$, $G_t$ would have a finite value as long as the reward sequence {$R_k$} is bounded. If ${\gamma} = 0$, the agent is “myopic” in being concerned only with maximizing immediate rewards. As ${\gamma}$ approaches 1, the agent becomes more "farsighted" in such a way that consider future rewards more seriously.

#### Exercise 3.5 The equations in Section 3.1 are for the continuing case and need to be modified (very slightly) to apply to episodic tasks. Show that you know the modifications needed by giving the modified version of (3.3).

Answer :

\begin{equation*}
\sum_{s' \in \mathcal{S^+}}^{} \sum_{r \in \mathcal{R}}^{} p(s', r{\space}|s, a) = 1 {\space} for {\space} all {\space} s \in \mathcal{S^+}, {\space} a \in \mathcal{A}
\end{equation*}

#### Exercise 3.6 Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for -1 upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing formulation of this task?

Answer : The return would be $-{\gamma}^{k}$ as well, but with upper bound of $-{\gamma}^{T}$.

#### Exercise 3.7 Imagine that you are designing a robot to run a maze. You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes—the successive runs through the maze—so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?

Answer : The reason why our agent is doing poorly is due to the fact we didn't introduce time limit (jerkiness) to our agent. We can solve this issue y adding a penalty of −1 for each timestep the robot spends in the maze without finding the exit.

#### Exercise 3.8 Suppose $\gamma$ = 0.5 and the following sequence of rewards is received $R_1 = -1$, $R_2 =2$, $R_3 =6$, $R_4 =3$,  and $R_5 =2$ , with $T=5$. What are $G_0,G_1,...,G_5$? Hint: Work backwards.

Answer : 

\begin{equation*}
G_5 = 0 
\end{equation*}

\begin{equation*}
G_4 = R_5 + {\gamma}G_5 = 2 + 0.5 \times 0 = 2
\end{equation*}

\begin{equation*}
G_3 = R_4 + {\gamma}G_4 = 3 + 0.5 \times 2 = 4
\end{equation*}

\begin{equation*}
G_2 = R_3 + {\gamma}G_3 = 6 + 0.5 \times 4 = 8
\end{equation*}

\begin{equation*}
G_1 = R_2 + {\gamma}G_2 = 2 + 0.5 \times 8 = 6
\end{equation*}

\begin{equation*}
G_0 = R_1 + {\gamma}G_1 = -1 + 0.5 \times 6 = 2
\end{equation*}

#### Exercise 3.9 Suppose $\gamma = 0.9$ and the reward sequence is $R_1 = 2$ followed by an infinite sequence of 7s. What are $G_1$ and $G_0$?

Answer:

\begin{equation*}
G_1 = 7 \sum_{k = 0}^{\infty} {0.9}^{k} = 7 \times \frac{1}{0.1} = 70
\end{equation*}

\begin{equation*}
G_0 = 2 + 7 \sum_{k = 1}^{\infty} {0.9}^{k} = 2 + 7 \times \frac{1}{0.1} = 72
\end{equation*}

#### Exercise 3.10 Prove the second equality in (3.10).

Answer : 

\begin{equation*}
G_t = \lim_{n\to\infty} \sum_{k = 0}^{n-1} {\gamma}^{k} = \lim_{n\to\infty} 1 + \gamma + {\gamma}^2 + ... + {\gamma}^{n-1}
\end{equation*}

\begin{equation*}
{\gamma}G_t = \lim_{n\to\infty} \gamma + {\gamma}^2 + ... + {\gamma}^{n} 
\end{equation*}

\begin{equation*}
G_t - {\gamma}G_t = \lim_{n\to\infty} 1 - {\gamma}^{n} 
\end{equation*}

\begin{equation*}
G_t = \lim_{n\to\infty} \frac{1 - {\gamma}^{n}}{1-{\gamma}} = \frac{1}{1-{\gamma}} {\space} if {\space} 0 \leq {\gamma} < 1
\end{equation*}

# Policies and Value Functions

Estimation of expected return (future rewards) for the agent to be in a given state (or expected return for performing a given action in a given state). Of course the rewards the agent can expect to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to particular ways of acting, called policies.
A policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy ${\pi}$ at time $t$, then ${\pi}(a|s)$ is the probability that $A_t = a$ if $S_t = s$.

#### Exercise 3.11 If the current state is $S_t$, and actions are selected according to stochastic policy $\pi$ , then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?

Answer : 

\begin{equation*}
r(s,a) = \sum_{r \in \mathcal{R}}^{} r \sum_{s' \in \mathcal{S}}^{} {\pi}(a|s') p(s', r{\space}|s, a)
\end{equation*}

<hr>

The value function of a state s under a policy $\pi$, denoted $v_{\pi}(s)$, is the expected return when starting in $s$ and following $\pi$ thereafter. We call the function $v_{\pi}$ the state-value function for policy $\pi$.

\begin{equation*}
v_{\pi}(s) \doteq {\mathbb{E}}_{\pi}[G_t | S_t = s] = {\mathbb{E}}_{\pi} [\sum_{k=0}^{\infty} {\gamma}^{k}R_{t+k+1} | S_t = s], {\space}{\space} for {\space}{\space} all {\space}{\space} s \in \mathcal{S}
\end{equation*}

Similarly, we define the value of taking action $a$ in state s under a policy $\pi$, denoted $q_{\pi}(s,a)$, as the expected return starting from $s$, taking the action $a$, and thereafter following policy $\pi$. We call $q_{\pi}$ the action-value function for policy $\pi$.

\begin{equation*}
q_{\pi}(s, a) \doteq {\mathbb{E}}_{\pi}[G_t | S_t = s, A_t = a] = {\mathbb{E}}_{\pi} [\sum_{k=0}^{\infty} {\gamma}^{k}R_{t+k+1} | S_t = s, A_t = a], {\space}{\space} for {\space}{\space} all {\space}{\space} s \in \mathcal{S}, {\space}{\space} a \in \mathcal{A}
\end{equation*}

#### Exercise 3.12 Give an equation for $v_{\pi}$ in terms of $q_{\pi}$ and $\pi$.

Answer : 

\begin{equation*}
v_{\pi}(s) \doteq {\mathbb{E}}_{\pi}[G_t | S_t = s] = \sum_{a \in \mathcal{A}}^{} {\pi}(a|s)q_{\pi}(s, a)
\end{equation*}

#### Exercise 3.13 Give an equation for $q_{\pi}$ in terms of $v_{\pi}$ and the four-argument $p$.

Answer :

\begin{equation*}
q_{\pi}(s, a) \doteq {\mathbb{E}}_{\pi}[R_t + {\gamma}G_{t+1} | S_t = s, A_t = a] = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}^{} p(s',r{\space}|s, a)[r + {\gamma}v_{\pi}(s')]
\end{equation*}

### Example 3.5 Grid World

In [18]:
class GW_Env():

    def __init__(self, reward=[10,5,0,-1], telPos1=1, telPos2=3, gridSize=5):
        self.reward = reward
        self.gridSize = gridSize
        self.telPos1 = telPos1
        self.telPos2 = telPos2

    def getTransitions(self):

        aM = {'north' : 0, 'south' : 1, 'east' : 2, 'west' : 3}
        toTelPos1 = ((self.gridSize-1) * self.gridSize) + self.telPos1
        toTelPos2 = (int(self.gridSize/2) * self.gridSize) + self.telPos2

        trans = np.zeros((self.gridSize**2, 4, self.gridSize**2))
        for i in range(self.gridSize):
            for j in range(self.gridSize):
                pos = (i * self.gridSize) + j

                if pos == self.telPos1:
                    trans[pos, aM['west'], toTelPos1] = 1
                    trans[pos, aM['east'], toTelPos1] = 1
                    trans[pos, aM['north'], toTelPos1] = 1
                    trans[pos, aM['south'], toTelPos1] = 1

                elif pos == self.telPos2:
                    trans[pos, aM['west'], toTelPos2] = 1
                    trans[pos, aM['east'], toTelPos2] = 1
                    trans[pos, aM['north'], toTelPos2] = 1
                    trans[pos, aM['south'], toTelPos2] = 1

                else:
                    if j == 0:
                        trans[pos, aM['west'], pos] = 1
                        trans[pos, aM['east'], pos+1] = 1
                    elif j == self.gridSize-1:
                        trans[pos, aM['west'], pos-1] = 1
                        trans[pos, aM['east'], pos] = 1
                    else: 
                        trans[pos, aM['east'], pos+1] = 1
                        trans[pos, aM['west'], pos-1] = 1

                    if i == 0:
                        trans[pos, aM['north'], pos] = 1
                        trans[pos, aM['south'], pos+self.gridSize] = 1
                    elif i == self.gridSize-1:
                        trans[pos, aM['north'], pos-self.gridSize] = 1
                        trans[pos, aM['south'], pos] = 1
                    else: 
                        trans[pos, aM['north'], pos-self.gridSize] = 1
                        trans[pos, aM['south'], pos+self.gridSize] = 1
                
        return trans
    
    def getRewards(self):

        aM = {'north' : 0, 'south' : 1, 'east' : 2, 'west' : 3}
        toTelPos1 = ((self.gridSize-1) * self.gridSize) + self.telPos1
        toTelPos2 = (int(self.gridSize/2) * self.gridSize) + self.telPos2

        rewards = np.empty((self.gridSize**2, 4, self.gridSize**2))
        for i in range(self.gridSize):
            for j in range(self.gridSize):
                pos = (i * self.gridSize) + j

                if pos == self.telPos1:
                    rewards[pos, aM['west'], toTelPos1] = self.reward[0]
                    rewards[pos, aM['east'], toTelPos1] = self.reward[0]
                    rewards[pos, aM['north'], toTelPos1] = self.reward[0]
                    rewards[pos, aM['south'], toTelPos1] = self.reward[0]

                elif pos == self.telPos2:
                    rewards[pos, aM['west'], toTelPos2] = self.reward[1]
                    rewards[pos, aM['east'], toTelPos2] = self.reward[1]
                    rewards[pos, aM['north'], toTelPos2] = self.reward[1]
                    rewards[pos, aM['south'], toTelPos2] = self.reward[1]
                
                else:
                    if j == 0:
                        rewards[pos, aM['west'], pos] = -1
                    elif j == self.gridSize-1:
                        rewards[pos, aM['east'], pos] = -1

                    if i == 0:
                        rewards[pos, aM['north'], pos] = -1
                    elif i == self.gridSize-1:
                        rewards[pos, aM['south'], pos] = -1
                
        return rewards


gwEnv = GW_Env() # reward_teleport1 = 10, reward_teleport2 = 5
                # reward_moving = 0, reward_wall = -1
print("reward from teleport A = "
      + str(gwEnv.getRewards()[1,0,21]))
print("reward from hitting a wall = " 
      + str(gwEnv.getRewards()[0,0,0]))
print("transition probability legal move = "
      + str(gwEnv.getTransitions()[6,1,11]))
print("transition probability illegal move = "
      + str(gwEnv.getTransitions()[7,1,11]))

reward from teleport A = 10.0
reward from hitting a wall = -1.0
transition probability legal move = 1.0
transition probability illegal move = 0.0


<hr>

#### Exercise 3.14 The Bellman equation (3.14) must hold for each state for the value function $v_{\pi}$ shown in Figure 3.2 (right) of Example 3.5. Show numerically that this equation holds for the center state, valued at +0.7, with respect to its four neighboring states, valued at +2.3, +0.4, -0.4, and +0.7. (These numbers are accurate only to one decimal place.)

<hr>

#### Exercise 3.15 In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant $c$ to all the rewards adds a constant, $v_c$, to the values of all states, and thus does not affect the relative values of any states under any policies. What is vc in terms of $c$ and $\gamma$?

<hr>

#### Exercise 3.16 Now consider adding a constant $c$ to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.

<hr>

### Example 3.6 Golf

<hr>

#### Exercise 3.17 What is the Bellman equation for action values, that is, for $q_{\pi}$? It must give the action value $q_{\pi}(s,a)$ in terms of the action values, $q_{\pi}(s',a')$, of possible successors to the state–action pair $(s,a)$. Hint: the backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.

<hr>

#### Exercise 3.18 The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:

<img src="Dropbox/Screenshots/1.png" width=450 height=450>

#### Give the equation corresponding to this intuition and diagram for the value at the root node, $v_{\pi}(s)$, in terms of the value at the expected leaf node, $q_{\pi}(s, a)$, given $S_t = s$. This equation should include an expectation conditioned on following the policy, $\pi$. Then give a second equation in which the expected value is written out explicitly in terms of $\pi(a|s)$ such that no expected value notation appears in the equation.

<hr>

#### Exercise 3.19 The value of an action, $q_{\pi}(s,a)$, depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:

<img src="Dropbox/Screenshots/2.png" width=360 height=360>

#### Give the equation corresponding to this intuition and diagram for the action value, $q_{\pi}(s,a)$, in terms of the expected next reward, $R_{t+1}$, and the expected next state value, $v_{\pi}(S_{t+1})$, given that $S_t =s$ and $A_t =a$. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of $p(s',r{\space}|s,a)$ defined by (3.2), such that no expected value notation appears in the equation.

<hr>

# Optimal Policies and Optimal Value Functions

### Example 3.7: Optimal Value Functions for Golf

<hr>

### Example 3.8: Solving the Gridworld

<hr>

### Example 3.9: Bellman Optimality Equations for the Recycling Robot

<hr>

#### Exercise 3.20 Draw or describe the optimal state-value function for the golf example.

<hr>

#### Exercise 3.21 Draw or describe the contours of the optimal action-value function for putting, $q_{*}(s,putter)$, for the golf example.

<hr>

#### Exercise 3.22 Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, ${\pi}_{left}$ and ${\pi}_{right}$. What policy is optimal if $\gamma$ = 0? If $\gamma$ = 0.9? If $\gamma$ = 0.5?

<img src="Dropbox/Screenshots/3.png" width=200 height=200>

<hr>

#### Exercise 3.23 Give the Bellman equation for $q_{*}$ for the recycling robot.

<hr>

#### Exercise 3.24 Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically, and then to compute it to three decimal places.

<hr>

#### Exercise 3.25 Give an equation for $v_{*}$ in terms of $q_{*}$.

<hr>

#### Exercise 3.26 Give an equation for $q_{*}$ in terms of $v_{*}$ and the four-argument $p$.

<hr>

#### Exercise 3.27 Give an equation for ${\pi}_{*}$ in terms of $q_{*}$.

<hr>

#### Exercise 3.28 Give an equation for ${\pi}_{*}$ in terms of $v_{*}$ and the four-argument $p$.

<hr>

#### Exercise 3.29 Rewrite the four Bellman equations for the four value functions $(v_{\pi}$, $v_{*}$, $q_{\pi}$, and $q_{*})$ in terms of the three argument function p (3.4) and the two-argument function $r$ (3.5).