# Lecture 8 - Policy Gradient I

provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)

---

<div class="alert alert-block alert-info">
Table of Contents: <br>
    
<ul>
    <li>1. <a href="#1.-Introduction">Introduction</a>
    <li>2. <a href="#2.-Policy-Optimization">Policy Optimization</a></li>
    <li>3. <a href="#3.-Resource">Resource</a></li>
</ul>
</div>

# 1. Introduction

In the previous lecture, we tried to approximate the value or action-value functions using parameters $\theta$:

$$
V_{\theta}(s) \approx V^{\pi}(s)\\
Q_{\theta}(s, a) \approx Q^{\pi}(s, a)
$$

The policy will usually be $\epsilon$-greedy applied on top of these value functions. Today we will directly parameterize the policy:

$$
\pi_{\theta}(s, a) = \mathbb{P}[a~|~s; \theta] \hspace{1em} (Eq.~1)\\
$$

* Value Based
    * Learnt value function
    * implicit policy (e.g. $\epsilon$-greedy)
* Policy Based
    * no value function
    * learnt policy
* Actor-Critic
    * learnt value function
    * learnt policy

Advantages of Policy-Based RL:
* better convergence properties
* effective in high-dimensional/continuous action spaces
Disadvantages:
* usually converge to local rather than global optimum
* evaluating policy is inefficient

* Goal: given a policy $\pi_{\theta}(s, a)$ with parameters $\theta$, find best $\theta$
* we measure the quality of the policy (policy evaluation)
* in __episodic environments__, we can use the start value of the policy:

$$
J_{1}(\theta) = V^{\pi_{\theta}}(s_{1}) \hspace{1em} (Eq.~2)\\
J_{avV}(\theta) = \sum_{s} d^{\pi_{\theta}}(s) V^{\pi_{\theta}}(s) \hspace{1em} (Eq.~3)\\
J_{avR}(\theta) = \sum_{s} d^{\pi_{\theta}}(s) \sum_{a} \pi_{\theta}(s, a) R(a, s) \hspace{1em} (Eq.~4)\\
$$

$d^{\pi_{\theta}}(s)$ is the stationary distribution of states under $\pi_{\theta}$.

Eq. 2: in episodic environments we can use the start value of the policy state $s_{1}$. Eq. 3: in continuing environments we can use the average value. Eq. 4: in continuing environments we can also use the average reward per time-step.

# 2. Policy Optimization

Policy-based RL (we've been doing model-based/model-free value-based RL) is an optimization problem. There are gradient-free methods for optimization:
* hill climbing
* genetic algorithms

Non-gradient optimization methods are good baselines, but they are sample inefficient.

Policy gradient algorithms search for a _local_ maximum in $V(\theta)$. There are many different PG algorithms.

$$
V(\theta) = V^{\pi_{\theta}}\\
\Delta \theta = \alpha \nabla_{\theta}V(\theta)\\
\nabla_{\theta}V(\theta) = s_{t} = \begin{pmatrix}
    \frac{\partial V(\theta)}{\partial \theta_{1}} \\
    \vdots \\
	\frac{\partial V(\theta)}{\partial \theta_{n}}
\end{pmatrix}
$$

$\alpha$ is a step-size parameter.

__PG by Finite Differences__ is simple, noisy, and inefficient, but can sometimes be good.

To evaluate policy gradient of $\pi_{\theta}(s, a)$ <br>
For each dimension $k \in [1, n]$ <br>
$\quad$ Estimate $k$-th partial derivative of objective function w.r.t. $\theta$ <br>
$\quad$ Perturb $\theta$ by small amount $\epsilon$ in $k$-th dimension

$$
\frac{\partial V(\theta)}{\partial \theta_{k}} \approx \frac{V(\theta + \epsilon u_{k}) - V(\theta)}{\epsilon}
$$

$u_{k}$ is a unit vector with 1 in $k$-th component, 0 elsewhere. <br>

_Algorithm 1. PG by Finite Differences._

__Likelihood Ratio Policies__

Define a state-action trajectory: $\tau = (s_{0}, a_{0}, r_{0}, ..., s_{T - 1}, r_{T - 1}, s_{T})$ <br>
Let $R(\tau) = \sum_{t=0}^{T}R(s_{t}, a_{t})$ be the sum of rewards for a trajectory $\tau$. <br>
The policy value is:

$$
V(\theta) = \sum_{\tau} P(\tau; \theta) R(\tau) \hspace{1em} (Eq.~5)\\
\underset{\theta}{argmax} V(\theta) = \underset{\theta}{argmax} \sum_{\tau} P(\tau; \theta)R(\tau) \hspace{1em} (Eq.~6)\\
\begin{equation}
    \begin{split}
    \nabla_{\theta}V(\theta) & = \sum_{\tau} P(\tau; \theta) R(\tau) \nabla_{\theta} log ~ P(\tau;\theta)\\
    & = \mathbb{E}_{\tau}[\sum_{t = 0}^{T - 1} \nabla_{\theta} log \pi_{\theta}(a_{t}~|~s_{t}) G_{t}^{(i)}]\\
    & \approx \hat{g} =  (\frac{1}{m}) \sum_{i = 1}^{m} R(\tau^{(i)}) \nabla_{\theta} log ~ P(\tau;\theta)\\
    & = \frac{1}{m} \sum_{i = 1}^{m} R(\tau^{(i)}) \sum_{t = 0}^{T_{i}} \nabla_{\theta} log \pi_{\theta}(a_{t}~|~s_{t})\\
    & = \frac{1}{m} \sum_{i = 1}^{m} \sum_{t = 0}^{T - 1} \nabla_{\theta} log \pi_{\theta}(a_{t}~|~s_{t}) G_{t}^{(i)}\\
    \end{split}
\end{equation} \hspace{1em} (Eq.~7)\\
$$

$P(\tau; \theta)$ is the probability over trajectories when executing policy $\pi_{\theta}$.

In likelihood ratio policies, we often see value functions to be modeled like Eq. 5. Eq. 6 is a mathematical formulation for how we can optimize for a policy. Eq. 7 is the actual gradient of the value function w.r.t. the parameters of the policy $\theta$. Notice how policy gradient algorithms directly optimize for the policy (and in this case, it optimizes via using gradients). The last 2 equations in Eq. 7 is the most important as those are the crux of REINFORCE, one classic policy gradient algorithm!

Note: $log \pi(a_{t}~|~s_{t}; \theta)$ is the same as $log \pi_{\theta}(a_{t}~|~s_{t})$.

The rest of the lecture highlights the REINFORCE algorithm, one common policy gradient method.

For an implementation of the algorithm I'd recommend this: https://github.com/ageron/handson-ml2/blob/master/18_reinforcement_learning.ipynb.

For understanding it, I recommend this: https://medium.com/intro-to-artificial-intelligence/reinforce-a-policy-gradient-based-reinforcement-learning-algorithm-84bde440c816.

# 3. Resource

If you missed the link right below the title, I'm providing the resource here again along with the course website.

- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)
- [Course Website](http://web.stanford.edu/class/cs234/index.html)

This is a series of 15 lectures provided by Stanford.
