<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" align="left" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a>&nbsp;| [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en) | <a href="https://erachelson.github.io/RLclass_MVA/">https://erachelson.github.io/RLclass_MVA/</a>

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Chapter 6: The policy gradient theorem</div>

So far, we have been mostly concerned with the problem of function approximation, that is "how do we store $q(s,a)$ in a convenient way, that is amenable to learning, retrieval and optimization?". A specific feature of the problems we tackled was that they had few discrete actions, making the dependency on $s$ the key difficulty. In particular, when looking for a $q$-greedy action, the $\max_{a\in A}$ problem had a straightforward solution as we could iterate through actions and retain the best one. We now turn to a more general case where the actions are too numerous to be enumerated (either because the action space is continuous or because it just has too many actions). Too many states motivated the introduction of function approximators for $v$ and $q$; too many actions similarly lead to function approximation for $\pi$. In the present chapter, we consider the general policy space (with possible approximation) and ask the question "how do we find a monotonically improving sequence of policies?".

<div class="alert alert-success">

**Learning outcomes**   
</div>

# Policy gradient methods

<div class="alert alert-success">

**Bottomline question:**   
The previous chapters have focussed on *action-value methods*; they aimed at estimating $q^*$ in order to deduce $\pi^*$, or they jointly optimized $q$ and $\pi$. Could we directly optimize $\pi$?
</div>

Suppose we have a policy $\pi_\theta$ parameterized by a vector $\theta$. Our goal is to find the parameter $\theta^*$ corresponding to $\pi^*$.

<div class="alert alert-warning">
    
**Exercise:**  
Recall the FrozenLake environment.  
How many states and how many actions were there in this environment? 
What would be a policy parameterization which does not make any approximation (ie. that can represent any policy in the policy space) for stationary, memoryless, stochastic policies?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

There are 16 states and 4 actions in the FrozenLake game.

A stationary, memoryless, stochastic policy is a mapping from $S$ to $\Delta_A$. Since $S$ and $A$ are discrete, it can be represented in tabular form as:
$$\pi = \left[ \begin{array}{cccc}
\pi(a_0|s_0) & \pi(a_1|s_0) & \pi(a_2|s_0) & 1 - \sum_{i=0}^2 \pi(a_i|s_0) \\
\ldots & & & \\
\pi(a_0|s_{|S|}) & \pi(a_1|s_{|S|}) & \pi(a_2|s_{|S|}) & 1 - \sum_{i=0}^2 \pi(a_i|s_{|S|})
\end{array} \right].$$

This parameterization enables representing any stochastic policy. It involves $|A|-1=3$ parameters per line, and $|S|=16$ lines, so in total $3\times 16 = 58$ parameters.

As previously, parameterization does not necessarily involve approximation! As the action set will become large or continuous, parameterization will enable generalization across actions at the cost of approximation, but a tabular representation is also a policy parameterization.
</details>

Remarks:
- $\pi_\theta$ might not be able to represent $\pi^*$. We will take a shortcut and call $\pi^*$ the best policy among the $\pi_\theta$ ones.
- For discrete state and action spaces, the tabular policy representation is a special case of policy parameterization.
- Policy parameterization is a (possibly useful) way of introducing prior knowledge on the set of the desired policies.
- The optimal deterministic policies might not belong to the policy subspace of $\pi_\theta$, thus it makes sense to consider stochastic policies for $\pi_\theta$.
- It makes even more sense to consider stochastic policies that it opens the family of environments that we can tackle, like partially observable MDPs or multi-player games.

For stochastic policies, we shall write $\pi_\theta(a|s)$.

In the remainder of the chapter, we will assume that $\pi_\theta(a|s)$ is differentiable with respect to $\theta$.

Suppose now we define some performance metric $J(\pi_\theta) = J(\theta)$. If $J$ is differentiable and a stochastic estimate $\tilde{\nabla}_\theta J(\theta)$ of the gradient is available, then we can define the stochastic gradient ascent update procedure:
$$\theta \leftarrow \theta + \alpha \tilde{\nabla}_\theta J(\theta).$$

We will call **policy gradient methods** all methods that follow such a procedure (whether or not they also learn a value function or not).

<div class="alert alert-success">

**Policy gradient method**   
We call **policy gradient method** any method that performs stochastic gradient ascent on the policy's parameters.  
Given a stochastic estimate $\tilde{\nabla}_\theta J(\theta)$ of a policy's performance criterion with respect to the policy's parameters, such a method implements the update procedure: 
$$\theta \leftarrow \theta + \alpha \tilde{\nabla}_\theta J(\theta).$$
</div>

Remarks: 
- Note that $J$ is a generic criterion. For example, $J$ could be defined as the $\gamma$-discounted value of a starting state (or a distribution of starting states), or as the undiscounted reward over a certain horizon, or as the average reward.
- Note that this family of methods can use any gradient estimate for $\tilde{\nabla}_\theta J(\theta)$: formal calculus, finite differences, automated differentiation, evolution strategies, etc.
- Why is it interesting to look at methods which explicitly store a policy function? Because the evaluation of the policy in a given state $s$ does not require the maximization step ($\max_a Q(s,a)$), which might be computationally costly, especially for continuous actions. Instead, it replaces it with a call to $\pi_\theta(s)$ (or a draw from $\pi_\theta(a|s)$). This argument makes actor-critic architectures or direct policy search a method of choice for continuous actions domains (especially common in Robotics) and Policy Gradient is one of them.
- When do policy gradient approaches outperform value-based ones? It's hard to give a precise criterion; it really depends on the problem. One thing that comes into play is how easy it is to approximate the optimal policy or the optimal value function. If one is simpler than the other (by "simpler", we mean "it is easier to find a parameterization whose spanned function space almost includes the function to approximate"), then it is a good heuristic to try to approximate it. But this criterion might itself be hard to assess.

<div class="alert alert-warning">

**Exercise**  
From the class on Markov Decision Processes, can you recall a scalar criterion $J(\pi)$ whose optimization is provably equivalent to finding a policy that dominates any other one in every state?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

Provided $\rho_0$ has non-zero probability mass on all states, an optimal policy is a solution to $\max_\pi J(\pi) = \mathbb{E}_{s_0\sim \rho_0}[V^\pi(s_0)]$.

This makes optimizing $J(\pi)$ a legitimate goal for finding optimal policies: from now on we will work with $J(\pi)$.
</details>

**Notations**

- We consider probability density functions $p(X)$ for all random variables $X$.
- For a policy $\pi_\theta$ and a random variable $X$ we write indifferently $p(X|\pi_\theta) = p(X|\theta)$.
- We will write $\pi_\theta(s)$ the policy's distribution over actions in $s$, and $\pi_\theta(a|s)$ the probability that this policy picks action $a$ in $s$.
- A trajectory is noted $\tau = (s_t,a_t)_{t\in \mathbb{N}}$.
- The state random variable at step $t$ is $S_t$ and its law's density is $p_t(s)$.
- The action random variable at step $t$ is $A_t$.

This very short chapter serves as a header to two families of estimators for policy gradients, which we will study separately.  
First, we will look at Monte Carlo policy gradient methods. These methods rely on simulating one or several full trajectories with $\pi$ to compute $\nabla J$. These methods include the very well-known REINFORCE method, but also most of the literature on evolutionary reinforcement learning.  
Then, we will follow the same kind of reasoning the enabled moving from $v(s) = \sum_t \gamma^t r_t$ to the Bellman equation, and will study how $\nabla J$ can be expressed as a quantity one can estimate from independent $(s,a,r,s')$ samples. All the corresponding algorithms will be motivated by the famous *policy gradient theorem*.

# Reminders

Before we start, let's warm up with three key reminders.

## Reminder 1: the policy improvement theorem

<div class="alert alert-success">

**Policy improvement theorem**  
If $\pi_{n+1} \in \mathbb{G}q^{\pi_n}$, then $q^{\pi_{n+1}} \geq q^{\pi_n}$.
</div>

As stated in the introduction, we are searching for a monotonously improving sequence of policies. This is exactly what the policy improvement theorem provided: if one knows $q^{\pi_0}$, then taking $\pi_1 \in \mathbb{G}q^{\pi_0}$ yields a policy that is no worse than $\pi_0$. Previously, this lead us to the policy iteration algorithm. Now, finding a greedy policy with respect to $q^{\pi_0}$ might not be that easy. Gradient ascent seems like a reasonable perspective from that point of view. So we already get a feeling there will be a connection between finding a improving sequence of policies and finding a greedy policy with respect to some value function.

<div class="alert alert-warning"><b>Exercise (easy misconception):</b><br>
Policy iteration yields a monotonically improving sequence of value functions. Can we say the same thing about value iteration or modified policy iteration?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

Interestingly, no. To obtain the guarantee that $q^{\pi_1}$ is better than $q^{\pi_0}$, one really needs to define $\pi_1$ as greedy with respect to $q^{\pi_0}$.  
Making it greedy with respect to $(\mathbb{T}^{\pi_0})^m q$ (for some initial $q$) is not sufficient.  
Funny, isn't it? We cannot guarantee a priori that the sequence of policies provided by DP methods is of increasing quality (beside the case of policy iteration). It still converges to $\pi^*$, but monotonicity is not guaranteed (at least not through the policy improvement theorem).
</details>

## Reminder 2: the linearity of the Bellman evaluation equation

Let us write $p^\pi(s'|s)$ the probability density of the transition kernel of the Markov chain corresponding to our MDP, controlled by $\pi$. In other words $p^\pi(s'|s) = \int_A \pi(a|s) p(s'|s,a) da$.

The transition kernel $p^\pi$ is a linear operator that transforms a value function into another:
$$(p^\pi v)(s) = \int_S v(s') p^\pi(s'|s) ds'.$$

Similarly, let us write $r^\pi = \int_A \pi(a|s) r(s,a) da$.

With these remarks, one can rewrite the Bellman equation:

<div class="alert alert-success">

**Bellman equation (evaluation)**  
$v^\pi$ is the unique solution to $v = \mathbb{T}^\pi v$ with
$$\mathbb{T}^\pi v = r^\pi + \gamma p^\pi v.$$
</div>

Note that, in finite dimension, writing that $p^\pi$ is a linear operator is just an application of the equivalence between the matrix of an operator and the operator itself. This is particularly useful to simplify notations because expectations can then be written as inner products:
$$\mathbb{E}_{s'} [v(s')] = \int_S v(s') p^\pi(s'|s) ds' = (p^\pi v)(s).$$
But this applies also to other linear operators defined from probability maps:
$$v^\pi(s) = \mathbb{E}_{a\sim \pi} [q^\pi(s,a)] = \int_A q^\pi(s,a) \pi(a|s) da = (\pi q^\pi)(s).$$
So we can write, in short:
$$v^\pi = \pi q^\pi.$$
All in all, this boils down to writing products between linear operators (and products between matrices when $\mathcal{S}$ and $\mathcal{A}$ are finite dimensional).  
These notations are much lighter than our previous ones, and remain true whatever the nature of $\mathcal{S}$ and $\mathcal{A}$ (finite or not). They are somehow less intuitive than writing full expectations, but help make equations much more readable and will come in handy in the next section.

<div class="alert alert-warning"><b>Exercise:</b><br>
    
Note that the transition kernel $p(s'|s,a)$ is a linear operator too. Rewrite the Bellman evaluation equation on state-action value functions, using the notations above. Try playing a bit with the notation, by composing $\pi$ and $p$ operators to rewrite other versions of the Bellman equation.
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

$p$ is a linear operator such that $(p v)(s,a) = \int_S v(s') p(s'|s,a) ds'$.  
It transforms a function $\mathcal{S}\rightarrow \mathbb{R}$ into another function $\mathcal{S}\times\mathcal{A} \rightarrow \mathbb{R}$.  
In finite dimension, it can be represented by an $|\mathcal{S}||\mathcal{A}|\times|\mathcal{S}|$ matrix.

The Bellman equation on state-actions value functions is:
$$(\mathbb{T}^\pi q)(s,a) = r(s,a) + \gamma \int_S \left[\int_A \pi(a'|s') q(s',a') da'\right] p(s'|s,a) ds'.$$
That is:
$$(\mathbb{T}^\pi q)(s,a) = r(s,a) + \gamma \int_S (\pi q)(s') p(s'|s,a) ds'.$$
And so:
$$(\mathbb{T}^\pi q)(s,a) = r(s,a) + \gamma (p (\pi q))(s,a).$$
Turning back to the general function form yields:
$$\mathbb{T}^\pi q = r + \gamma p \pi q$$

Note that identifying the matrix corresponding to the $\pi$ operator in finite dimension is a bit tricky and requires playing with matrix dimensions appropriately. We won't go any further into details here, the point of this exercise was just to introduce the notation.

Just playing with the notation, we have:
$$(p^\pi v)(s) = \int_S v(s') p^\pi(s'|s) ds' = \int_A \int_S v(s') p(s'|s,a) ds' \pi(a|s) da = (\pi p v)(s),$$
$$p^\pi v = \pi p v.$$
Since one has $r^\pi = \pi r$, the Bellman equation on state value functions becomes:
$$v = \pi (r + \gamma p v).$$

Note that we used $r$ as an $r(s,a)$ function above.

The take-away message of this exercise is really that the evaluation equation can be cast in the simple form of composition of linear operators, which makes notations a lot easier to manipulate.
</details>

## Reminder 3: the state occupation measure

<div class="alert alert-success">

**The state occupation measure and the scalar value of a policy**  
$$J(\pi) = \mathbb{E}_{(s_i,a_i)_{i \in \mathbb{N}}} \left[ \sum_{t=0}^\infty \gamma^t r(s_t,a_t)  | \pi, p_0 \right] = \mathbb{E}_{\substack{s\sim\rho^\pi \\ a\sim \pi}} \left[ r(s,a) \right]$$
with $\rho^\pi(s) = \sum_{t=0}^\infty \gamma^t p_t(s|\pi)$ is the *state occupancy measure under policy $\pi$ and starting distribution $p_0$*.
</div>

Note that $\rho^\pi$ is not a proper distribution per se (it sums to $\frac{1}{1-\gamma}$), and is sometimes also called the *improper state distribution under policy $\pi$* or the *improper state visitation frequency under policy $\pi$*, so we cannot really take an expectation given that $s$ is drawn according to $\rho^\pi$. What we wrote above is a slight notation abuse for:
$$J(\pi) = \int_S \left[ \int_A r(s,a) \pi (a|s) da \right] \rho^\pi(s) ds.$$

One could easily normalize $\rho^\pi$ to make it a proper probability density function. This would only introduce a constant $(1-\gamma)$ multiplicative factor to the scalar criterion $J(\pi)$ above.

In plain words, the value of a policy $\pi$ is the average value of the rewards when states are sampled according to $\rho^\pi$ and actions are sampled according to $\pi$.

<div class="alert alert-warning"><b>Exercise:</b><br>
Recall how one proves this result.
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

One has:
$$J(\pi) = \mathbb{E}_{(s_i,a_i)_{i \in \mathbb{N}}} \left[ \sum_{t=0}^\infty \gamma^t r(s_t,a_t)  | \pi, p_0 \right].$$

In the sum above:
- $s_0$ is drawn according to $p_0$,
- $a_0$ is drawn according to $\pi(s_0)$,
- all subsequent $s_{t+1}$ are drawn according to $p(s_{t+1}|s_t,a_t)$,
- and all subsequent actions $a_{t+1}$ are drawn according to $\pi(s_{t+1})$.

To facilitate the reading, we will drop the mention to $p_0$ in what follows.
We can switch the sum and the expectation and get:  
$$J(\pi) = \sum_{t=0}^\infty \gamma^t \mathbb{E}_{(s_i,a_i)_{i \in \mathbb{N}}} \left[ r(s_t,a_t)  | \pi \right].$$
But $\mathbb{E}_{(s_i,a_i)_{i \in \mathbb{N}}} \left[ r(s_t,a_t)  | \pi \right] = \mathbb{E}_{s_t,a_t} \left[ r(s_t,a_t)  | \pi \right]$. So:
$$J(\pi) = \sum_{t=0}^\infty \gamma^t \mathbb{E}_{s_t,a_t} \left[ r(s_t,a_t)  | \pi \right].$$
Now let's introduce the density of $(s_t,a_t)$:
$$J(\pi) = \sum_{t=0}^\infty \gamma^t \int_S \int_A r(s_t,a_t) p(s_t,a_t|\pi) ds_t da_t.$$
But $p(s_t,a_t|\pi) = p(s_t|\pi) p(a_t|s_t,\pi)$. By definition, $p(s_t|\pi) = p_t(s|\pi)$ and $p(a_t=a|s_t=s,\pi) = \pi(a|s)$. So:
$$J(\pi) = \sum_{t=0}^\infty \gamma^t \int_S \int_A r(s,a) p_t(s|\pi) \pi(a|s) ds da.$$
Let us isolate the terms that concern only states:
$$J(\pi) = \int_S \left[ \int_A r(s,a) \pi(a|s) da \right] \sum_{t=0}^\infty \gamma^t p_t(s|\pi) ds.$$
By noting $\rho^\pi(s) = \sum_{t=0}^\infty \gamma^t p_t(s|\pi)$ we obtain:
$$J(\pi) = \int_S \left[ \int_A r(s,a) \pi (a|s) da \right] \rho^\pi(s) ds.$$
And so finally, with a slight notation abuse because $\rho^\pi$ is not a probability distribution:
$$J(\theta) = \mathbb{E}_{\substack{s\sim\rho^\pi \\ a\sim \pi}} \left[ r(s,a) \right].$$
</details>

<div class="alert alert-warning"><b>Exercise:</b><br>
In the second reminder, we introduced $p^\pi$ as the operator which transformed a function $\mathcal{S}\rightarrow\mathbb{R}$ into another one, such that $(p^\pi v)(s) = \int_S v(s') p^\pi(s'|s) ds'$. Use this to write $\rho^\pi$ using $p^\pi$ and $\rho_0$.
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

One has $p_t(s|\pi) = \rho_0 (p^\pi)^t$.  
So $\rho^\pi = \rho_0 \sum_t (\gamma p^\pi)^t$.
</details>

# The policy gradient theorem

In this section, we derive our first key result in this class: can we obtain a usable expression for $\nabla_\theta J(\theta)$, so that we can take gradient steps which improve the current policy (hence the name of this whole chapter: policy improving gradients)?

## A Bellman equation on value gradients

As indicated in the previous section, we want to optimize the scalar criterion 
$$J(\theta) = \mathbb{E}_{s_0\sim\rho_0} [v^{\pi_\theta}(s_0)].$$

So, quite immediately, 
$$\nabla_\theta J(\theta) = \mathbb{E}_{s_0\sim\rho_0} [\nabla_\theta v^{\pi_\theta}(s_0)].$$

Let us look a little into $\nabla_\theta V^{\pi_\theta}(s_0)$. We have 
$$v^\pi(s) = \mathbb{E}_{a\sim \pi} [q^\pi(s,a)] = \int_A q^\pi(s,a) \pi(a|s) da.$$

So 
$$\nabla_\theta v^\pi(s) = \int_A \Big[q^\pi(s,a) \nabla_\theta \pi(a|s) + \pi(a|s) \nabla_\theta q^\pi(s,a)\Big] da.$$

Now, using the definition of $q^\pi$, 
$$\nabla_\theta v^\pi(s) = \int_A \Big[q^\pi(s,a) \nabla_\theta \pi(a|s) + \pi(a|s) \nabla_\theta \int_S (r(s,a,s') + \gamma v^\pi(s')) p(s'|s,a) ds'\Big] da.$$

Quite obviously, $\nabla_\theta r(s,a,s') = 0$. So we obtain
$$\nabla_\theta v^\pi(s) = \int_A \Big[q^\pi(s,a) \nabla_\theta \pi(a|s) +  \gamma \pi(a|s) \nabla_\theta \int_S v^\pi(s') p(s'|s,a) ds'\Big] da.$$

We can split this in two parts, and switch the integration order in the second term. This yields
$$\nabla_\theta v^\pi(s) = \int_A q^\pi(s,a) \nabla_\theta \pi(a|s) da +  \gamma  \int_S \nabla_\theta v^\pi(s') \Big[\int_A \pi(a|s) p(s'|s,a) da \Big] ds'.$$

Let us write $p^\pi(s'|s) = \int_A \pi(a|s) p(s'|s,a) da$. It is the transition kernel of the Markov chain defined by the MDP controled by $\pi$. We have:
$$\nabla_\theta v^\pi(s) = \int_A q^\pi(s,a) \nabla_\theta \pi(a|s) da +  \gamma  \int_S \nabla_\theta v^\pi(s') p^\pi(s'|s) ds'.$$

Swithcing back to an expectation over $s'$, we have:
$$\nabla_\theta v^\pi(s) = \int_A q^\pi(s,a) \nabla_\theta \pi(a|s) da + \gamma \mathbb{E}_{s'\sim p^\pi(s'|s)} [\nabla_\theta v^\pi(s')].$$

This is a Bellman equation on $\nabla_\theta V^\pi$. The first terms acts as a one-step reward, and the second is the expectation over next states of $\nabla v^\pi$.

We used the $v^\pi(s)$, $q^\pi(s,a)$, etc. notations in order remain didactic, but the writing above is quite ugly. Let us turn to function notation and operator products, as in the warm-up exercises, and re-write the same thing:
$$v^\pi = \pi q^\pi$$
Let us drop the $\theta$ subscript in $\nabla_\theta$. Then (with a slight notation abuse on the order inside the products, to keep things readable):
\begin{align*}
    \nabla v^\pi &= \nabla (\pi q^\pi)\\
    &= q^\pi \nabla \pi + \pi \nabla q^\pi\\
    &= q^\pi \nabla \pi + \pi \nabla (r + \gamma p v^\pi)\\
    &= q^\pi \nabla \pi + \gamma \pi p\nabla v^\pi\\
    &= q^\pi \nabla \pi + \gamma p^\pi \nabla v^\pi
\end{align*}


This yields a nice Bellman equation on the value function gradient:

<div class="alert alert-success">

**Bellman equation on $\nabla v^\pi$**  
$$\nabla v^\pi = q^\pi \nabla \pi + \gamma p^\pi \nabla v^\pi$$
where $\nabla$ is the gradient operator, with respect to the policy parameters. In expanded notation:

$$\nabla_\theta v^\pi(s) = \int_A q^\pi(s,a) \nabla_\theta \pi(a|s) da + \gamma \mathbb{E}_{s'\sim p^\pi(s'|s)} [\nabla_\theta v^\pi(s')].$$
</div>

## The policy gradient theorem

As with the previous Bellman equations we studied, this one too has a fixed point, corresponding to the infinite sum of discounted rewards, that is:
$$\nabla v^\pi = \sum_{t=0}^\infty \gamma^t (p^\pi)^t (q^\pi \nabla \pi).$$

Recall the definition of the state occupation measure $\rho^\pi = \rho_0 \sum_{t=0}^\infty \gamma^t (p^\pi)^t$. Then we immediately have:
$$\nabla_\theta J(\theta) = \rho^\pi (q^\pi \nabla \pi)$$

Let us make this explicit to understand how we can obtain an estimator of this gradient:
\begin{align*}
\nabla_\theta J(\theta) &= \int_S (q^\pi \nabla \pi)(s) \rho^\pi(s) ds\\
                        &= \int_S \left[ \int_A q^\pi(s,a) \nabla \pi(a|s) da \right] \rho^\pi(s) ds\\
\end{align*}

Let us suppose we can draw samples from $\rho^\pi$, for example by supposing that a replay buffer will mimic $\rho^\pi$ (note that this assumption's validity is far from obvious, but we shall make it anyway). We could have a Monte Carlo estimator of the policy gradient if we had a Monte Carlo estimator of the term inside the brackets.

Let us remark that:
\begin{align*}
\int_A q^\pi(s,a) \nabla \pi(a|s) da &= \int_A q^\pi(s,a) \frac{\nabla \pi(a|s)}{\pi(a|s)} \pi(a|s) da \\
 &= \int_A q^\pi(s,a) \nabla \log \pi(a|s) \pi(a|s) da\\
 &= \mathbb{E}_{a \sim \pi(s)} \left[ q^\pi(s,a) \nabla \log \pi(a|s) \right]
\end{align*}

We have turned our integral into an expectation. This is sometimes called the *nabla-log* trick.

And finally we have expressed the gradient of $J(\theta)$ as directly proportional to the value of $q^\pi$ and the gradient of $\log\pi$:
<div class="alert alert-success">

**Policy gradient theorem:**  
$$\nabla_\theta J(\theta) \propto \mathbb{E}_{\substack{s\sim\rho^\pi \\ a\sim \pi}} \left[ q^\pi(s,a) \nabla_\theta \log\pi(a|s)\right]$$

Or in simpler notation:
$$\nabla_\theta J(\theta) = \rho^\pi \pi (q^\pi \nabla \log \pi)$$
</div>

The interpretation of this theorem is very straightforward:  
To increase the average value of policy $\pi_\theta$ over the starting distribution $p_0$, we should change $\theta$ in a direction that is a linear combination of the $\nabla_\theta \log \pi(a|s)$, where the coefficients are the expected outcomes $q^\pi(s,a)$ of picking action $a$ in $s$.  
Since $\nabla_\theta \log \pi(a|s)$ is a direction that increases the log probability of $a$ in $s$, we can rephrase the last sentence. The policy gradient tells us:  
**To increase the value of the current policy, we should increase the log-probability of $a$ in $s$ in proportion to the expected outcome of $a$ in $s$.**

## The link with $q$-greedy actions

We can now connect the policy gradient we just wrote, with the search for policies that are greedy with respect to some value function $q$.

It is quite tempting, when one has a value function $q$ and a policy $\pi$, to try to compute $\nabla_\theta \mathbb{E}_{s\sim\rho} [q(s,\pi(s))]$ given that we have states sampled from some distribution $\rho$. The question we would like to answer is:  
**Under what conditions does this *greediness gradient* match the true policy gradient that enables finding a policy that is better than $\pi$?**

Let us simply rewrite our greediness gradient:
$$\nabla_\theta \mathbb{E}_{s\sim\rho} [q(s,\pi(s))] = \nabla \rho (\pi q)$$
And since $\rho$ and $q$ do not depend on $\theta$ (with the same slight notation abuse as before on the order inside the product):
$$\nabla_\theta \mathbb{E}_{s\sim\rho} [q(s,\pi(s))] = \rho (q \nabla \pi)$$
As before, one can turn this into a gradient that can be estimated via sampling, using the nabla-log trick:
$$\nabla_\theta \mathbb{E}_{s\sim\rho} [q(s,\pi(s))] = \rho \pi (q \nabla \log \pi)$$

This expression is equal to the true policy gradient if $\rho=\rho^\pi$ and if $q = q^\pi$.

In the end, **this is a generalization of the policy improvement theorem**:  
if $q=q^\pi$ (just as in the policy improvement theorem),  
and if $s$ is sampled according to $\rho^\pi$,  
then the direction $\nabla_\theta \mathbb{E}_{s\sim\rho}[q(s,\pi(s)] = \mathbb{E}_{\substack{s\sim\rho\\ a\sim\pi}} [q(s,a) \nabla_\theta \log \pi(a|s)]$,  
is an ascent direction for $J(\pi)$.  
So the sequence of policies following infinitesimal increments along this greediness gradient follows a monotonous improvement in value.

Conversely, if $q \neq q^\pi$ and if $\rho \neq \rho^\pi$ we can't guarantee anything.

## Deterministic policies: a limit case

Left out for now.

## Algorithms

# A roll-out based view on policy gradients

## A Monte-Carlo policy gradient

## REINFORCE

# Actor-critic algorithms

## Introducing a critic

## Baselines in policy gradients

## Algorithms

# Homework

- DDPG
- TD3
- SAC
- A2C
- Running rollouts in parallel
- PG on continuous action domains
- PG for the finite horizon criterion
- GAE
- From off-policy PG to TRPO
- PPO
- Gradient free policy search