# Lecture 10 - Policy Gradient III

provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)

---

<div class="alert alert-block alert-info">
Table of Contents: <br>
    
<ul>
    <li>1. <a href="#1.-Introduction">Introduction</a>
    <li>2. <a href="#2.-Need-for-Automatic-Step-Size-Tuning">Need for Automatic Step Size Tuning</a>
    <ul>
        <li>2.1. <a href="#2.1.-Local-Approximation">Local Approximation</a></li>
        <li>2.2. <a href="#2.2.-Trust-Region">Trust Region</a></li>
        <li>2.3. <a href="#2.3.-TRPO">TRPO</a></li>
    </ul>
    </li>
    <li>3. <a href="#3.-Resource">Resource</a></li>
</ul>
</div>

# 1. Introduction

Today's lecture will cover the 2 other methods for automatic step-size tuning: trust regions and the TRPO algorithm.

# 2. Need for Automatic Step Size Tuning

Recall the objective function we defined in Lecture 9.

$$
\begin{equation}
    \begin{split}
    L_{\pi}(\tilde{\pi}) = V(\tilde{\theta}) & = V(\theta) + \mathbb{E}_{\pi_{\tilde{\theta}}}[\sum_{t = 0}^{\infty} \gamma^{t} A_{\pi}(s_{t}, a_{t})]\\
    & = V(\theta) + \sum_{s} \mu_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a~|~s) A_{\pi}(s, a)\\
    \mu_{\tilde{\pi}}(s) & = \mathbb{E}_{\tilde{\pi}}[\sum_{t = 0}^{\infty} \gamma^{t} I(s_{t} = s)]
    \end{split}
\end{equation} \hspace{1em} (Eq.~1)\\
$$

## 2.1. Local Approximation

I copied over the text from the previous lecture for completeness.

We can slightly rewrite Eq. 4 so that we have a substitute for $\mu_{\tilde{\pi}}$:

$$
L_{\pi}(\tilde{\pi}) = V(\theta) + \sum_{s} \mu_{\pi}(s) \sum_{a} \tilde{\pi}(a~|~s) A_{\pi}(s, a) \hspace{1em} (Eq.~2)\\
$$

Eq. 5, instead  of using the discounted weighted frequency of state $s$ under policy $\mu_{\tilde{\pi}}$, uses $\mu_{\pi}$, the current policy's discounted weighted frequency of state $s$

This begs the question: how do Eq. 3 and Eq. 4 fit into our current understanding of policy gradients? Over Lecture 8 and Lecture 9, we have seen a lot of formulas involving value functions.

For now, I'm still not too sure. Let's give it some time.

## 2.2. Trust Region

Disclaimer: I won't cover too much of the theory!

With our new formulation in Eq. 1, we want to ask: is there a bound to the new policy's performance by optimizing on the surrogate objective (the local approximation)?

$$
\pi_{new}(a~|~s) = (1 - \alpha) \pi_{old}(a~|~s) + \alpha \pi'(a~|~s)
$$

Consider a __mixture policy__ (a blend of 2 policies), it will have a percentage of the old and a percentage of the new policy. For general stochastic policies we have the theorem (Eq. 3):

$$
D_{TV}^{max}(\pi_{1}, \pi_{2}) = \underset{s}{max} D_{TV}(\pi_{1}(\cdot~|~s), \pi_{2}(\cdot~|~s))\\
\epsilon = \underset{s}{max}[\mathbb{E}_{a \sim \pi'(a~|~s)}[A_{\pi}(s, a)]]\\
D_{TV}(p, q)^{2} \le D_{KL}(p, q)\\
C = \frac{4 \epsilon \gamma}{(1 - \gamma)^{2}}\\
\begin{equation}
    \begin{split}
V^{\pi_{new}} & \ge L_{\pi_{old}}(\pi_{new}) - \frac{4 \epsilon \gamma}{(1 - \gamma)^{2}}(D_{TV}^{max}(\pi_{old}, \pi_{new}))^{2}\\
    & \ge L_{\pi_{old}}(\pi_{new}) - \frac{4 \epsilon \gamma}{(1 - \gamma)^{2}}D_{KL}^{max}(\pi_{old}, \pi_{new}) = M_{i}(\pi)
    \end{split}
\end{equation} \hspace{1em} (Eq.~3)\\
V^{\pi_{i + 1}} - V^{\pi_{i}} \ge M_{i}(\pi_{i + 1}) - M_{i}(\pi_{i}) \hspace{1em} (Eq.~4)\\
$$

From the theorem (Eq. 3), we can derive Eq. 4. Eq. 4 simply says that we can have a monotonically improving general stochastic policy. For $C$, we tend to make this a hyperparameter.

With the theorem established, we can put it into practice. Let's formulate our objective function: 

$$
\underset{\theta}{max} L_{\pi_{old}}(\pi_{new}) - CD_{KL}^{max}(\pi_{old}, \pi_{new}) \hspace{1em} (Eq.~4)\\
$$

We can rewrite this to:

$$
\underset{\theta}{max} L_{\pi_{old}}(\pi_{new}) \hspace{1em} (Eq.~5)\\
subject ~ to ~ D_{KL}^{s \sim \mu_{\theta_{old}}}(\pi_{old}, \pi_{new}) \le \delta
$$

This formulation leverages not only the previous objective function in local approximation, but also the new lower bound theorem. The constraint serves as a __trust region__ for the KL divergence between the old and new policies.

We make 3 substitutions (note $\tilde{\pi} = \pi_{new} = \theta_{new}$ and the same goes for the old):

1. Substituting $\sum_{s} \mu_{\theta_{old}}(s)$

$$
L_{\pi_{old}} = V(\theta) + \sum_{s} \mu_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a~|~s) A_{\pi}(s, a)\\
becomes\\
L_{\pi_{old}} = V(\theta) + \frac{1}{1 - \gamma} \mathbb{E}_{s \sim \mu_{\theta_{old}}} [\sum_{a} \tilde{\pi}(a~|~s) A_{\pi}(s, a)] \hspace{1em} (Eq.~5)\\
$$

This substitution is made because our state space can be continuous and infinite so summing over it would be impossible in practice.

2. Substituting $\sum_{a} \tilde{\pi}(a~|~s) A_{\pi}(s, a)$

$$
(Eq.~5)\\
becomes\\
L_{\pi_{old}} = V(\theta) + \frac{1}{1 - \gamma} \mathbb{E}_{s \sim \mu_{\theta_{old}}} [\mathbb{E}_{a \sim q}[\frac{\pi_{\theta}(a~|~s_{n})}{q(a~|~s_{n})}A_{\theta_{old}}(s_{n}, a)]] \hspace{1em} (Eq.~6)\\
$$

Again, summing over actions can be a continuous set. So instead, we use importance sampling.

3. Substituting $A_{\theta_{old}}$

$$
(Eq.~6)\\
becomes\\
\underset{\theta}{max} \mathbb{E}_{s \sim \mu_{\theta_{old}}, a \sim q} [\frac{\pi_{\theta}(a~|~s)}{q(a~|~s)}Q_{\theta_{old}}(s, a)] \hspace{1em} (Eq.~7)\\
subject ~ to ~ \mathbb{E}_{s \sim \mu_{\theta_{old}}} D_{KL}(\pi_{old}(\cdot~|~s), \pi_{new}(\cdot~|~s)) \le \delta
$$

## 2.3. TRPO

for iteration = 1,2, ... do <br>
$\quad$ Run policy for $T$ timesteps or $N$ trajectories <br>
$\quad$ Estimate advantage function at all time steps <br>
$\quad$ Compute policy gradient $g$ <br>
$\quad$ Use CG to compute $F^{-1}$g where $F$ is the Fisher information matrix <br>
$\quad$ Do line search on surrogate loss and KL constraint <br>

_Algorithm 1. Trust Region Policy Optimization (TRPO)._

Algorithm 1 is just a brief look into TRPO! 

# 3. Resource

If you missed the link right below the title, I'm providing the resource here again along with the course website.

- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)
- [Course Website](http://web.stanford.edu/class/cs234/index.html)

This is a series of 15 lectures provided by Stanford.
