<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" align="left" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a>&nbsp;| [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en) | <a href="https://supaerodatascience.github.io/reinforcement-learning/">https://supaerodatascience.github.io/reinforcement-learning/</a>

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Class 2: Policy and value optimization, formulating the problem</div>

<div class="alert alert-success">

**Learning outcomes**  
We move on to defining what is the optimization problem in RL and how we formulate it. Bt the end of this section, you should be able to:
- link optimal policies and value functions,
- give a definition for a policy optimization problem,
- write the Bellman equations (evaluation and optimality) and define a value optimization problem,
- sketch out methods for policy and value optimization
</div>

## The optimization problem

Recall that an optimal policy $\pi = (\pi_t)_{t\in\mathbb{N}}$ is a policy which dominates over any other policy in every state:
$$\pi^* \textrm{ is optimal}\Leftrightarrow \forall s\in S, \ \forall \pi, \ V^{\pi^*}(s) \geq V^\pi(s)$$

Recall also that for the discounted criterion over infinite horizon problems, there always exists at least one optimal policy under the form $\pi:s\mapsto a$.

If all states have a non-zero probability of being visited by any policy, then an optimal policy should pick optimal actions along any trajectory starting in (any) $s_0$, so it should pick optimal actions in all states. Then finding a policy which maximize $V^\pi$ in all states is actually the same as finding a policy which maximizes the expected value $V^\pi(s_0)$ of a fixed initial state $s_0$, or the expected value over any distribution on initial states $\mathbb{E}_{s_0\sim \rho_0}[V^\pi(s_0)]$.

To fix ideas, let's write $J(\pi) = \mathbb{E}_{s_0\sim \rho_0}[V^\pi(s_0)]$. Then:
<div class="alert alert-success">

**The policy optimization problem:**  
An optimal policy is a solution to $\max_\pi J(\pi) = \mathbb{E}_{s_0\sim \rho_0}[V^\pi(s_0)]$.
</div>

Solving this core optimization problem is the problem we wish to solve.

It is time to define state-action value functions, also called Q-function, as they are a central object in RL. A Q-function is a function $S\times A \rightarrow \mathbb{R}$.

The state-action value function of policy $\pi$ is noted $Q^\pi$ and is the expected return of playing action $a$ in $s$, then playing policy $\pi$ in any subsequent state.

<div class="alert alert-success"><b>State-action value function of policy $\pi$</b><br>
$$Q^\pi(s,a) = \mathbb{E}\left( \sum\limits_{t=0}^\infty \gamma^t r\left(S_t, A_t, S_{t+1}\right) \bigg| S_0 = s, A_0=a, \pi \right)$$
</div>

To be precise and reuse the full notations from the MDP definition:
\begin{align*}
Q^\pi(s,a) &=\mathbb{E}\left[ \sum\limits_{t = 0}^\infty \gamma^t R_t \quad \Bigg| \quad \begin{array}{l}S_0 = s, A_0=a,\\ A_t \sim \pi(S_t)\textrm{ for }t>0,\\ S_{t+1}\sim p(\cdot|S_t,A_t),\\R_t = r(S_t,A_t,S_{t+1})\end{array} \right],\\
 &= \mathbb{E}_{s'} \left[ r(s,a,s') + \gamma V^\pi(s') \right], \\
 &= r(s,a) + \gamma \mathbb{E}_{s'} \left[ V^\pi(s') \right]
\end{align*}

<br>
<br>
<img src="img/Qfunctions.png" style="height: 200px;"></img>

Note that this definition of Q-functions uses the $\gamma$-discounted criterion, but that it can be straightforwardly extended to any other criterion.  
Let $C((R_t)_{t\in \mathbb{N}})$ be a criterion defined on the sequence of reward random variables.  
Then the return random variable $G^\pi(s,a)$ is the random variable $C((R_t)_{t\in \mathbb{N}})$ given that $S_0 = s$, $A_0=a$, $A_t \sim \pi(S_t)$ for $t>0$, $S_{t+1}\sim p(\cdot|S_t,A_t)$, and $R_t = r(S_t,A_t,S_{t+1})$.  
And the corresponding Q-function for this criterion is simply 
$$Q^\pi(s,a) = \mathbb{E}\left[ C((R_t)_{t\in \mathbb{N}}) \quad \Bigg| \quad \begin{array}{l}S_0 = s, A_0=a,\\ A_t \sim \pi(S_t)\textrm{ for }t>0,\\ S_{t+1}\sim p(\cdot|S_t,A_t),\\R_t = r(S_t,A_t,S_{t+1})\end{array} \right|.$$

## Characterizing value functions: the Bellman equations (40 minutes)

### Intuitions

Consider the maze below, where an agent can move North, South, East or West. The resulting transition is deterministic and a reward of $+1$ is gained when exiting the maze (which terminates the game). Otherwise all rewards are zero. Bumping into a wall terminates the game with a reward of zero.

<img src="img/grid_raw.png" width="200px"></img>

Let's consider the policy $\pi$ that always moves East.

<img src="img/grid_policy.png" width="200px"></img>

<div class="alert alert-warning">
    
**Poll**  
[https://linkto.run/p/NO9LB7NP](https://linkto.run/p/NO9LB7NP)  
Without writing any equation, what is the value of the top-right cell under this policy?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

$V^\pi((3,3)) = 1$
</details>

Now let's take $\gamma=0.9$.

<div class="alert alert-warning">
    
**Poll**  
[https://linkto.run/p/03LL2V5H](https://linkto.run/p/03LL2V5H)  
Without writing any equation, what is the value of the top-middle cell under this policy? What is the value of the bottom-right cell?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

The value of $(2,3)$ is the expected discounted sum of what one gets from applying $\pi$ from $(2,3)$. Since the $\pi$ is deterministic and the transitions are deterministic too, $\pi(2,3)$ always take us to state $(3,3)$. So $V^\pi((2,3)) = 0 + \gamma \times V^\pi((3,3)) = 0.9$.
    
The value of $(3,1)$ is the expected infinite sum of discounted rewards from $(3,1)$. Since the agent keeps bumping into the wall when applying $\pi$, it never exits the maze and this is an infinite sum of zero terms. Hence $V^\pi((2,3)) = 0$.
</details>

Let's draw the value function.

<img src="img/grid_vpi.png" width="200px"></img>

Suppose you are currently in cell $(1,2)$ and would like to choose what action to take. Suppose also that you know the value function above. You need to put a scalar value on all four actions. To evaluate each action, let's estimate what we can get by applying the action and then using $\gamma \times V^\pi(s)$ to estimate what can obtain in the long run after this first action. Define $Q^\pi((x,y),a)$ as the utility we estimate for each action $a$ in $(x,y)$.

<div class="alert alert-warning">
    
**Question / Poll**  
What is $Q^\pi((1,2),a)$ for action $a$ in $\{N,S,E,W\}$? What seems to be the most interesting first action to take, if we follow $\pi$ after?  
[https://linkto.run/p/5WV5GU62](https://linkto.run/p/5WV5GU62)
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

$Q^\pi((1,2),N) = 0 + \gamma \cdot \gamma^2 = 0.729$  
$Q^\pi((1,2),S) = 0 + \gamma \cdot 0 = 0$  
$Q^\pi((1,2),E) = 0 + \gamma \cdot 0 = 0$  
$Q^\pi((1,2),W) = 0$  
The best action seems to be $N$.
</details>

An optimal policy is quite easy to guess. Let's draw the optimal value function (the value function of any optimal policy).

<img src="img/grid_vopt.png" width="200px"></img>

Define $Q^*((x,y),a)$ as the utility we estimate for each action $a$ in $(x,y)$ if it is followed by an optimal policy.
<div class="alert alert-warning">
    
**Question**   
What is $Q^*((1,2),a)$ for action $a$ in $\{N,S,E,W\}$? What seems to be the most interesting first action to take, if we act optimally after? Rank the actions by utility.  
https://linkto.run/p/3OG6IJO3
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

$Q^*$ is what we gain immediately, plus $\gamma$ times what we expect to receive from applying an optimal policy in the state we reach by applying $a$.  
$Q^*((1,2),N) = 0 + \gamma\times\gamma^2=\gamma^3$  
$Q^*((1,2),S) = 0 + \gamma\times\gamma^4=\gamma^5$  
$Q^*((1,2),E) = 0 + \gamma\times\gamma^4=\gamma^5$  
$Q^*((1,2),W) = 0 + \gamma\times0=0$  
The best action seems to be $N$, followed by $W$, after that $S$ and $E$ are tied.
</details>

<div class="alert alert-warning"><b>Question (no poll)</b><br>
What property has the policy that always picks greedily the $Q^*$ maximizing action in each state?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

It is an optimal policy.
</details>

Now suppose $(1,2)$ is a special slippery cell. Going North has a $0.7$ probability of actually reaching $(1,3)$, but also a $0.2$ probability of staying in $(1,2)$ and a $0.1$ probability of ending in $(2,2)$. Note that this changes the problem and the optimal expected return function $V^*$.

<div class="alert alert-warning"><b>Question (no poll)</b><br>
Given this new problem, can you write $Q^*((1,2),N)$ as a function of $V^*(1,3)$, $V^*(1,2)$ and $V^*(2,2)$?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

When we take action $N$ in $(1,2)$, there are 3 possible outcomes:
- with probability $0.7$, reach $(1,3)$ and get reward $0$,
- with probability $0.2$, reach $(1,2)$ and get reward $0$,
- with probability $0.1$, reach $(2,2)$ and get reward $0$.

So what we can expect to get from applying $N$ in $(1,2)$ is:  
\begin{align*}
    Q^*((1,2), N) &= 0.7 \times (0+\gamma V^*(1,3)) + 0.2\times(0+\gamma V^*(1,2)) + 0.1\times(0+\gamma V^*(2,2))\\
    &= \gamma \left(0.7\times V^*(1,3) + 0.2\times V^*(1,2)+ 0.1\times V^*(2,2)\right)
\end{align*}
</details>

Now you can remark that if we knew the action $\pi^*((1,2))$ taken by an optimal policy in $(1,2)$, then $Q^*((1,2), \pi^*(1,2))$ would actually be precisely the optimal long-term return $V^*$ (since it would be the expected return of a policy that acts optimally at every time step, including the first one).

<div class="alert alert-warning"><b>Question</b><br>
Suppose an oracle tells us that $\pi^*((1,2))=N$. Using the previous exercice, write $V^*(1,2)$ as a function of $V^*(1,3)$, $V^*(1,2)$ and $V^*(2,2)$.
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

We have $V^*((1,2)) = Q^*((1,2),N)$, so
$$V^*((1,2)) = \gamma \left(0.7\times V^*(1,3) + 0.2\times V^*(1,2)+ 0.1\times V^*(2,2)\right)$$
</details>

We have introduced the key concepts upon which this secton is built: $V$ and $Q$ functions, and the relation between $V(s)$ and $V(s')$ when $s'$ can be reached from $s$ in one action. The next steps are now to write all this formally, prove strong properties and derive algorithms for computing value functions and policies.

### The evaluation equation

Drawing inspiration from the exercises above, we can define state-action value functions, also called Q-functions, which are a central object in RL. the state-action value function $Q^\pi$.A Q-function is a function $S\times A \rightarrow \mathbb{R}$.

The state-action value function of policy $\pi$ is noted $Q^\pi$ and is the expected return of playing action $a$ in $s$, then playing policy $\pi$ in any subsequent state.

<div class="alert alert-success"><b>State-action value function</b><br>
$$Q^\pi(s,a) = \mathbb{E}\left( \sum\limits_{t=0}^\infty \gamma^t r\left(S_t, A_t, S_{t+1}\right) \bigg| S_0 = s, A_0=a, \pi \right)$$
</div>

To be precise and reuse the full notations from the MDP definition:
\begin{align*}
Q^\pi(s,a) &=\mathbb{E}\left[ \sum\limits_{t = 0}^\infty \gamma^t R_t \quad \Bigg| \quad \begin{array}{l}S_0 = s, A_0=a,\\ A_t=\pi(S_t)\textrm{ for }t>0,\\ S_{t+1}\sim p(\cdot|S_t,A_t),\\R_t = r(S_t,A_t,S_{t+1})\end{array} \right],\\
 &= \mathbb{E}_{s'} \left[ r(s,a,s') + \gamma V^\pi(s') \right], \\
 &= r(s,a) + \gamma \mathbb{E}_{s'} \left[ V^\pi(s') \right]
\end{align*}

<br>
<br>
<img src="img/Qfunctions.png" style="height: 200px;"></img>

Note that this definition of Q-functions uses the $\gamma$-discounted criterion, but that it can be straightforwardly extended to any other criterion.  
Let $C((R_t)_{t\in \mathbb{N}})$ be a criterion defined on the sequence of reward random variables.  
Then the return random variable $G^\pi(s,a)$ is the random variable $C((R_t)_{t\in \mathbb{N}})$ given that $S_0 = s$, $A_0=a$, $A_t \sim \pi(S_t)$ for $t>0$, $S_{t+1}\sim p(\cdot|S_t,A_t)$, and $R_t = r(S_t,A_t,S_{t+1})$.  
And the corresponding Q-function for this criterion is simply 
$$Q^\pi(s,a) = \mathbb{E}\left[ C((R_t)_{t\in \mathbb{N}}) \quad \Bigg| \quad \begin{array}{l}S_0 = s, A_0=a,\\ A_t \sim \pi(S_t)\textrm{ for }t>0,\\ S_{t+1}\sim p(\cdot|S_t,A_t),\\R_t = r(S_t,A_t,S_{t+1})\end{array} \right|.$$

Let's remark that $V^\pi(s) = Q^\pi(s,\pi(s))$. Let's replace $a$ by $\pi(s)$ above and we obtain an important equation to characterize $V^\pi$.
<br>
<br>
<img src="img/V-DP.png" style="height: 200px;"></img>
$$V^\pi(s) = r(s,\pi(s)) + \gamma \mathbb{E}_{s'\sim p(s'|s,\pi(s))} \left[ V^\pi(s') \right]$$

This equation uses $V^\pi(s')$ in all $s'$ reachable from $s$ to define $V^\pi(s)$.  
Since this equation is true in all $s$, this provides as many equations as we have states.

<div class="alert alert-success"><b>Evaluation equation</b><br>
$V^\pi$ obeys the linear system of equations:
$$
V^\pi\left(s\right) = r(s,\pi(s)) + \gamma \mathbb{E}_{s'\sim p(s'|s,\pi(s))} \left[ V^\pi(s') \right]\\
$$
Similarly:
$$
Q^\pi\left(s,a\right) = r(s,a) + \gamma \mathbb{E}_{s'\sim p(s'|s,a)} \left[ Q^\pi(s',\pi(s')) \right]
$$
</div>

This leads to the introduction of the **Bellman evaluation operator**:
<div class="alert alert-success"><b>Evaluation operator $T^\pi$</b><br>
$T^\pi$ is an operator on value functions, that transforms a function $V:S\rightarrow \mathbb{R}$ into:
\begin{align*}
T^\pi V\left(s\right) &= r(s,\pi(s)) + \gamma \mathbb{E}_{s'\sim p(s'|s,\pi(s))} \left[ V(s') \right]\\
 &= r\left(s,\pi\left(s\right)\right) + \gamma \sum\limits_{s'\in S} p\left(s'|s,\pi\left(s\right)\right) V\left(s'\right)
\end{align*}
    
Similarly we can introduce an evaluation operator (with the same name $T^\pi$) over state-action value functions. <br> 
$T^\pi$ is an operator on state-action value functions, that transforms a function $Q:S\times A\rightarrow \mathbb{R}$ into:
\begin{align*}
T^\pi Q\left(s,a\right) &= r(s,a) + \gamma \mathbb{E}_{s'\sim p(s'|s,a)} \left[ Q(s',\pi(s')) \right]\\
 &= r\left(s,a\right) + \gamma \sum\limits_{s'\in S} p\left(s'|s,a\right) Q\left(s', \pi\left(s'\right)\right)
\end{align*}
</div>

Note that, fundamentally, we have written 4 times the same thing in the block above.  
So finding $V^\pi$ (resp. $Q^\pi$) boils down to solving the evaluation equation $V= T^\pi V$ (resp. $Q = T^\pi Q$).

We have gone far from our original FrozenLake problem. Let's make all this very concrete:
- A policy $\pi$ is an agent's behaviour
- In every state $s$, one can expect to gain $V^\pi(s)$ in the long run by applying $\pi$
- $V^\pi(s)$ is the sum of the reward on the first step $r(s,\pi(s))$ and the expected long-term return from the next state $\gamma \mathbb{E}_{s'} \left[V^\pi(s')\right]$ 
- The function $V^\pi$ actually obeys the linear system of equations above that simply link the value of a state with the values of its successors in an episode.

We can stop for a minute on the $T^\pi$ evaluation operator (that maps a function $S\rightarrow\mathbb{R}$ to a function $S\rightarrow\mathbb{R}$) and the search for $V^\pi$.

<div class="alert alert-success"><b>Properties of $T^\pi$</b><br>
<ol>
<li> $T^\pi$ is an affine operator, it defines a linear system of equations.<br>
<li> $T^\pi$ is a contraction mapping<br>
    Specifically, with $\gamma<1$, $T^\pi$ is a $\| \cdot \|_\infty$-contraction mapping over the $\mathcal{F}(S,\mathbb{R})$ (resp. $\mathcal{F}(S\times A,\mathbb{R})$) Banach space.<br>
$\Rightarrow$ With $\gamma<1$, $V^\pi$ (resp. $Q^\pi$) is the unique solution to the (linear) fixed point equation:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$V=T^\pi V$ (resp. $Q=T^\pi Q$).
</ol>
</div>

Let's use this second property to compute $Q^\pi$ for the policy that always moves right on FrozenLake.

Suppose we start with $Q_0(s,a) = 0$ for all $(s,a)$.

Recall that, in FrozenLake, rewards are provided under the $r(s,a,s')$ form.

Applying $T^\pi$ once results in:
$$Q_1(s,a) = \sum_{s'} p(s'|s,a) \left[ r(s,a,s') + \gamma Q_0(s',\pi(s')) \right]$$

In plain words, $Q_1$ is the one-step expected return under policy $\pi$.

Applying $T^\pi$ twice results in:
$$Q_2(s,a) = \sum_{s'} p(s'|s,a) \left[ r(s,a,s') + \gamma Q_1(s',\pi(s')) \right]$$

This is the two-step expected return.

And so on.

If we apply $T^\pi$ enough times, $Q_n$ should become closer to $Q^\pi$, whatever the chosen value for $Q_0$.

In more formal words, because $T^\pi$ is a contraction mapping, the sequence $Q_{n+1} = T^\pi Q_n$ converges to $T^\pi$'s fixed point.

Let us live-code this.

<div class="alert alert-warning"><b>Live coding</b><br>
Let's compute the sequence $Q_{n+1} = T^\pi Q_n$.
</div>

In [None]:
import numpy as np

pi = fl.RIGHT*np.ones((env.observation_space.n), dtype=np.uint8)
nb_iter = 20
gamma = 0.9

Q = np.zeros((env.observation_space.n, env.action_space.n))
Qpi_sequence = [Q]
for i in range(nb_iter):
    Qnew = np.zeros((env.observation_space.n, env.action_space.n))
    for x in range(env.observation_space.n):
        for a in range(env.action_space.n):
            outcomes = env.unwrapped.P[x][a]
            for o in outcomes:
                p = o[0]
                y = o[1]
                r = o[2]
                Qnew[x,a] += p * (r + gamma * Q[y,pi[y]])
    Q = Qnew
    Qpi_sequence.append(Q)

<div class="alert alert-warning"><b>Live coding</b><br>
Let's plot the sequence of $\| Q_n - Q_{n-1} \|_\infty$ to verify the convergence of the sequence.
</div>

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

residuals = []
for i in range(1, len(Qpi_sequence)):
    residuals.append(np.max(np.abs(Qpi_sequence[i]-Qpi_sequence[i-1])))

plt.plot(residuals)
plt.figure()
plt.semilogy(residuals);

### The optimality equation

We can unfold the same kind of reasoning on the value of an optimal policy. We write:
$$V^{\pi^*} = V^*, \quad Q^{\pi^*} = Q^*$$

<div class="alert alert-success"><b>Theorem: Optimal greedy policy</b><br>
Any policy $\pi$ defined by $\pi(s) \in \arg\max\limits_{a\in A} Q^*(s,a)$ is an optimal policy.
</div>

And $Q^*$ obeys the same type of recurrence relation:

<div class="alert alert-success"><b>Theorem: Bellman optimality equation</b><br>
The optimal value function obeys:
\begin{align*}
    V^*(s) &= \max\limits_{a\in A} \left[ r(s,a) + \gamma \mathbb{E}_{s'\sim p(s'|s,a)} V^*(s') \right]\\
        &= \max\limits_{a\in A} \left[ r(s,a) + \gamma \sum\limits_{s'\in S} p(s'|s,a) V^*(s') \right]
\end{align*}
or in terms of $Q$-functions:
\begin{align*}
    Q^*(s,a) &= r(s,a) + \gamma \mathbb{E}_{s'\sim p(s'|s,a)} \left[ \max_{a'\in A} Q^*(s',a') \right]\\
        &= r(s,a) + \gamma \sum\limits_{s'\in S}p(s'|s,a) \max\limits_{a'\in A} Q^*(s',a')
\end{align*}
</div>

As for the evaluation equation, we have actually written 4 times the same thing in the block above.  
We have also defined the **Bellman optimality operator $T^*$** (on $V$ and $Q$ functions) as:
<div class="alert alert-success"><b>Bellman optimality operator</b><br>
$$\left(T^*V\right)(s) = \max\limits_{a\in A} \left[ r(s,a) + \gamma \mathbb{E}_{s'\sim p(s'|s,a)} V(s') \right]$$
$$\left(T^*Q\right)(s,a) = r(s,a) + \gamma \mathbb{E}_{s'\sim p(s'|s,a)} \left[ \max_{a'\in A} Q(s',a') \right]$$
</div>

So finding $V^*$ (resp. $Q^*$) boils down to solving $V= T^* V$ (resp. $Q = T^* Q$).

<div class="alert alert-success"><b>Properties of $T^*$</b><br>
<ol>
<li> $T^*$ is non-linear.<br>
<li> $T^*$ is a contraction mapping<br>
With $\gamma<1$, $T^*$ is a $\| \cdot \|_\infty$-contraction mapping over the $\mathcal{F}(S,\mathbb{R})$ (resp. $\mathcal{F}(S\times A,\mathbb{R})$) Banach space.<br>
$\Rightarrow$ With $\gamma<1$, $V^*$ (resp. $Q^*$) is the unique solution to the fixed point equation:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$V=T^* V$ (resp. $Q=T^* Q$).
</ol>
</div>