<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" align="left" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a>&nbsp;| [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en) | <a href="https://erachelson.github.io/RLclass_MVA/">https://erachelson.github.io/RLclass_MVA/</a>

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Chapter 5: Policy improving gradients</div>

So far, we have been mostly concerned with the problem of function approximation, that is "how do we store $Q(s,a)$ in a convenient way, that is amenable to learning, retrieval and optimization?". A specific feature of the problems we tackled was that they had few discrete actions, making the dependency on $s$ the key difficulty. In particular, when looking for a $Q$-greedy action, the $\max_{a\in A}$ problem had a straightforward solution as we could iterate through actions and retain the best one. We now turn to a more general case where the actions are too numerous to be enumerated (either because the action space is continuous or because it just has too many actions). Too many states motivated the introduction of function approximators for $V$ and $Q$; too many actions similarly lead to function approximation for $\pi$. In the present chapter, we consider the general policy space (with possible approximation) and ask the question "how do we find a monotonically improving sequence of policies?".

<div class="alert alert-success">

**Learning outcomes**   
</div>

# Policy gradient methods

<div class="alert alert-success">

**Bottomline question:**   
The previous chapters have focussed on *action-value methods*; they aimed at estimating $Q^*$ in order to deduce $\pi^*$, or they jointly optimized $Q$ and $\pi$. Could we directly optimize $\pi$?
</div>

Suppose we have a policy $\pi_\theta$ parameterized by a vector $\theta$. Our goal is to find the parameter $\theta^*$ corresponding to $\pi^*$.

<div class="alert alert-warning">
    
**Exercise:**  
Recall the FrozenLake environment.  
How many states and how many actions were there in this environment? 
What would be a policy parameterization which does not make any approximation (ie. that can represent any policy in the policy space) for stationary, memoryless, stochastic policies?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

There are 16 states and 4 actions in the FrozenLake game.

A stationary, memoryless, stochastic policy is a mapping from $S$ to $\Delta_A$. Since $S$ and $A$ are discrete, it can be represented in tabular form as:
$$\pi = \left[ \begin{array}{cccc}
\pi(a_0|s_0) & \pi(a_1|s_0) & \pi(a_2|s_0) & 1 - \sum_{i=0}^2 \pi(a_i|s_0) \\
\ldots & & & \\
\pi(a_0|s_{|S|}) & \pi(a_1|s_{|S|}) & \pi(a_2|s_{|S|}) & 1 - \sum_{i=0}^2 \pi(a_i|s_{|S|})
\end{array} \right].$$

This parameterization enables representing any stochastic policy. It involves $|A|-1=3$ parameters per line, and $|S|=16$ lines, so in total $3\times 16 = 58$ parameters.

As previously, parameterization does not necessarily involve approximation! As the action set will become large or continuous, parameterization will enable generalization across actions at the cost of approximation, but a tabular representation is also a policy parameterization.
</details>

Remarks:
- $\pi_\theta$ might not be able to represent $\pi^*$. We will take a shortcut and call $\pi^*$ the best policy among the $\pi_\theta$ ones.
- For discrete state and action spaces, the tabular policy representation is a special case of policy parameterization.
- Policy parameterization is a (possibly useful) way of introducing prior knowledge on the set of the desired policies.
- The optimal deterministic policies might not belong to the policy subspace of $\pi_\theta$, thus it makes sense to consider stochastic policies for $\pi_\theta$.
- It makes even more sense to consider stochastic policies that it opens the family of environments that we can tackle, like partially observable MDPs or multi-player games.

For stochastic policies, we shall write $\pi_\theta(a|s)$.

In the remainder of the chapter, we will assume that $\pi_\theta(a|s)$ is differentiable with respect to $\theta$.

Suppose now we define some performance metric $J(\pi_\theta) = J(\theta)$. If $J$ is differentiable and a stochastic estimate $\tilde{\nabla}_\theta J(\theta)$ of the gradient is available, then we can define the stochastic gradient ascent update procedure:
$$\theta \leftarrow \theta + \alpha \tilde{\nabla}_\theta J(\theta).$$

We will call **policy gradient methods** all methods that follow such a procedure (whether or not they also learn a value function or not).

<div class="alert alert-success">

**Policy gradient method**   
We call **policy gradient method** any method that performs stochastic gradient ascent on the policy's parameters.  
Given a stochastic estimate $\tilde{\nabla}_\theta J(\theta)$ of a policy's performance criterion with respect to the policy's parameters, such a method implements the update procedure: 
$$\theta \leftarrow \theta + \alpha \tilde{\nabla}_\theta J(\theta).$$
</div>

Remarks: 
- Note that $J$ is a generic criterion. For example, $J$ could be defined as the $\gamma$-discounted value of a starting state (or a distribution of starting states), or as the undiscounted reward over a certain horizon, or as the average reward.
- Note that this family of methods can use any gradient estimate for $\tilde{\nabla}_\theta J(\theta)$: formal calculus, finite differences, automated differentiation, evolution strategies, etc.
- Why is it interesting to look at methods which explicitly store a policy function? Because the evaluation of the policy in a given state $s$ does not require the maximization step ($\max_a Q(s,a)$), which might be computationally costly, especially for continuous actions. Instead, it replaces it with a call to $\pi_\theta(s)$ (or a draw from $\pi_\theta(a|s)$). This argument makes actor-critic architectures or direct policy search a method of choice for continuous actions domains (especially common in Robotics) and Policy Gradient is one of them.
- When do policy gradient approaches outperform value-based ones? It's hard to give a precise criterion; it really depends on the problem. One thing that comes into play is how easy it is to approximate the optimal policy or the optimal value function. If one is simpler than the other (by "simpler", we mean "it is easier to find a parameterization whose spanned function space almost includes the function to approximate"), then it is a good heuristic to try to approximate it. But this criterion might itself be hard to assess.

<div class="alert alert-warning">

**Exercise**  
From the class on Markov Decision Processes, can you recall a scalar criterion $J(\pi)$ whose optimization is provably equivalent to finding a policy that dominates any other one in every state?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

Provided $\rho_0$ has non-zero probability mass on all states, an optimal policy is a solution to $\max_\pi J(\pi) = \mathbb{E}_{s_0\sim \rho_0}[V^\pi(s_0)]$.

This makes optimizing $J(\pi)$ a legitimate goal for finding optimal policies: from now on we will work with $J(\pi)$.
</details>

**Notations**

- We consider probability density functions $p(X)$ for all random variables $X$.
- For a policy $\pi_\theta$ and a random variable $X$ we write indifferently $p(X|\pi_\theta) = p(X|\theta)$.
- We will write $\pi_\theta(s)$ the policy's distribution over actions in $s$, and $\pi_\theta(a|s)$ the probability that this policy picks action $a$ in $s$.
- A trajectory is noted $\tau = (s_t,a_t)_{t\in \mathbb{N}}$.
- The state random variable at step $t$ is $S_t$ and its law's density is $p_t(s)$.
- The action random variable at step $t$ is $A_t$.

# The policy gradient theorem

In this section, we derive our first key result in this class: can we obtain a usable expression for $\nabla_\theta J(\theta)$, so that we can take gradient steps which improve the current policy (hence the name of this whole chapter: policy improving gradients)?

## A Bellman equation on value gradients

As indicated in the previous section, we want to optimize the scalar criterion 
$$J(\theta) = \mathbb{E}_{s_0\sim\rho_0} [V^{\pi_\theta}(s_0)].$$

So, quite immediately, 
$$\nabla_\theta J(\theta) = \mathbb{E}_{s_0\sim\rho_0} [\nabla_\theta V^{\pi_\theta}(s_0)].$$

Let us look a little into $\nabla_\theta V^{\pi_\theta}(s_0)$. But first, let's simplify our notations so that the reasoning appears more clearly. We will drop the $\theta$ subscripts almost everywhere ($\nabla$ stands for $\nabla_\theta$ and $\pi$ stands for $\pi_\theta$). Thus we have:
$$J(\theta) = \mathbb{E}_{s_0\sim\rho_0} [V^{\pi_\theta}(s_0)] = \rho_0 V^\pi.$$

And hence: 
$$\nabla_\theta J(\theta) = \rho_0 \nabla V^{\pi}.$$

Similarly, we will write $V^\pi(s) = \mathbb{E}_{a\sim \pi} [Q(s,a)] = \pi Q^\pi$.

As indicated in the previous section, we want to optimize the scalar criterion 
$$J(\theta) = \mathbb{E}_{s_0\sim\rho_0} [V^{\pi_\theta}(s_0)].$$

So, quite immediately, 
$$\nabla_\theta J(\theta) = \mathbb{E}_{s_0\sim\rho_0} [\nabla_\theta V^{\pi_\theta}(s_0)].$$

Let us look a little into $\nabla_\theta V^{\pi_\theta}(s_0)$. We have 
$$V^\pi(s) = \mathbb{E}_{a\sim \pi} [Q^\pi(s,a)] = \int_A Q^\pi(s,a) \pi(a|s) da.$$

So 
$$\nabla_\theta V^\pi(s) = \int_A \Big[Q^\pi(s,a) \nabla_\theta \pi(a|s) + \pi(a|s) \nabla_\theta Q^\pi(s,a)\Big] da.$$

Now, using the definition of $Q^\pi$, 
$$\nabla_\theta V^\pi(s) = \int_A \Big[Q^\pi(s,a) \nabla_\theta \pi(a|s) + \pi(a|s) \nabla_\theta \int_S (r(s,a,s') + \gamma V^\pi(s')) p(s'|s,a) ds'\Big] da.$$

Quite obviously, $\nabla_\theta r(s,a,s') = 0$. So we obtain
$$\nabla_\theta V^\pi(s) = \int_A \Big[Q^\pi(s,a) \nabla_\theta \pi(a|s) +  \gamma \pi(a|s) \nabla_\theta \int_S V^\pi(s') p(s'|s,a) ds'\Big] da.$$

We can split this in two parts, and switch the integration order in the second term. This yields
$$\nabla_\theta V^\pi(s) = \int_A Q^\pi(s,a) \nabla_\theta \pi(a|s) da +  \gamma  \int_S \nabla_\theta V^\pi(s') \Big[\int_A \pi(a|s) p(s'|s,a) da \Big] ds'.$$

Let us write $p^\pi(s'|s) = \int_A \pi(a|s) p(s'|s,a) da$. It is the transition kernel of the Markov chain defined by the MDP controled by $\$\pi$. We have
$$\nabla_\theta V^\pi(s) = \int_A Q^\pi(s,a) \nabla_\theta \pi(a|s) da +  \gamma  \int_S \nabla_\theta V^\pi(s') p^\pi(s'|s) ds'.$$

Swithcing back to expectations, we have:
$$\nabla_\theta V^\pi(s) = \mathbb{E}_{a\sim \pi(a|s)} [Q^\pi(s,a) \nabla_\theta \pi(a|s)] + \gamma \mathbb{E}_{s'\sim p^\pi(s'|s)} [\nabla_\theta V^\pi(s')].$$

This is a Bellman equation on $\nabla_\theta V^\pi$.

## The policy gradient theorem

## The link with $Q$-greedy actions

## Deterministic policies: a limit case

## Algorithms

# A roll-out based view on policy gradients

## A Monte-Carlo policy gradient

## REINFORCE

# Actor-critic algorithms

## Introducing a critic

## Baselines in policy gradients

## Algorithms

# Homework

- DDPG
- TD3
- SAC
- A2C
- Running rollouts in parallel
- PG on continuous action domains
- PG for the finite horizon criterion
- GAE
- From off-policy PG to TRPO
- PPO
- Gradient free policy search