# Introduction to Policy Gradients

This post will introduce the core concepts underlying various **policy gradient algorithms**. These algorithms are an approach to reinforcement learning utilizing **stochastic gradient ascent**.

## The Goal

Those of you familiar with neural networks will probably have heard of stochastic gradient descent. The goal of stochastic gradient descent is to calculate the gradient of a loss-function to then adjust the parameters of the network to minimize the loss-function by steping in the opposite direction of the gradient.

We can utlize the very similar stochastic gradient ascent to appraoch reinforcment learning problems. To do this, we will need a reward function $J(\theta)$ which tells us how well a given policy $\pi_\theta$ performs using the parameters $\theta$.  In reinforcment learning this is actually quite simple since reinforcment inherintly utilizes rewards $r$. Thus the reward function is chosen to be the exoected reward for a trajectory $\tau$ generated by the current policy $\pi_\theta$. Let $G(\tau)$ be the infinite horizon discounted-return starting at timestep $t = 0$ for the trajectory $\tau$. 

The derivation of the policy gradient using finite or undiscounted return are almost identical. For finite horizon $T$ simply replace all $\infty$ with $T$.

\begin{align}
    G(\tau) &= \sum_{t = 1}^\infty \gamma^t r_{t+1} \\
    J(\theta) &= \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} [ G(\tau) ] \\
\end{align}

If it is possible to now calculate the gradient of the reward function $J(\theta)$ we can take a stochastic gradient ascent step based on a hyperparameter $\alpha$ to slowly over many steps maximise the reward function:

\begin{align}
    \theta = \theta + \alpha \nabla_\theta J(\theta)
\end{align}

The only remaining probelem is to actually determine how the gradient $\nabla_\theta J(\theta)$ looks like.

## How to calculate the Gradient

##### Step 1:

We now that if $X$ is a random variable, $P(x)$ is the probability that $X = x$ and we want to calculate $\mathbb{E}[X]$ we can do this by calculating $\mathbb{E}[X] = \sum_x P(x) x$ and thereby summing over all possible values of $X=x$ multiplied with their respective probability.

We can do something similar for $J(\theta)$ by summing over all posible trajectories $\tau$ if we rely on the probability that $\tau$ occurs given policy $\pi_\theta$ determined by paramteres $\theta$ $P(\tau | \theta)$.

\begin{align}
    J(\theta) &= \sum_\tau P(\tau | \theta) G(\tau) \\
    \nabla_\theta J(\theta) &= \nabla_\theta \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} [ G(\tau) ] \\
    &= \nabla_\theta \sum_\tau P(\tau | \theta) G(\tau) \\
    &= \sum_\tau \nabla_\theta P(\tau | \theta) G(\tau) \\
\end{align}

##### Step 2:

Given a function $f(x)$ the log-derivative trick is useful for rewriting the derivative $\frac{d}{dx} f(x)$ and relies upon the derivative of $\log(x)$ being $\frac{1}{x}$ and the chain rule.

\begin{align}
    \frac{d}{dx} f(x) &= f(x) \frac{1}{f(x)} \frac{d}{dx} f(x) \\
    &= f(x)\frac{d}{dx} \log(f(x)) \\
\end{align}

Since $G(\tau)$ is not dependant on $\theta$ we can apply this trick to $P(\tau | \theta)$:

\begin{align}
    J(\theta) &= \sum_\tau \nabla_\theta P(\tau | \theta) G(\tau) \\
    &= \sum_\tau P(\tau | \theta) \nabla_\theta \log(P(\tau | \theta)) G(\tau) \\
\end{align}

##### Step 3:

An equation for $P(\tau | \theta)$ is still required, but can be found by considering the problem as an markov decision process (MDP). Let $p(s_1)$ be the probability of starting a trajectory in state $s_1$, then $P(\tau | \theta)$ can be expressed by multiplying the probabilities for each occurence of all states $s_t$ and actions $a_t$.

\begin{align}
    P(\tau | \theta) &= p(s_1) \prod_{t = 1}^\infty P(s_{t+1} | s_t, a_t) \pi_\theta(a_t | s_t) \\
    \log(P(\tau | \theta)) &= \log(p(s_1)) \sum_{t = 1}^\infty \big( \log(P(s_{t+1} | s_t, a_t)) + \log(\pi_\theta(a_t | s_t)\big)) \\
\end{align}

##### Step 4

Notice that $p(s_1)$ and $P(s_{t+1} | s_t, a_t)$ are also not dependant on $\theta$. 

\begin{align}
    \nabla_\theta \log(P(\tau | \theta)) &= \sum_{t = 1}^\infty \pi_\theta(a_t | s_t) \\
\end{align}

\begin{align}
    \nabla_\theta J(\theta) &= \sum_\tau P(\tau | \theta) \sum_{t = 1}^\infty \log(\pi_\theta(a_t | s_t)) G(\tau) \\
\end{align}

Now we can reverse step 1 leaving us with finished equation which can be estimated by sampling multiple trajectories and taking the mean of the gradients for the sampled trajcetories.

\begin{align}
    \nabla_\theta J(\theta) &= \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \left[ \sum_{t = 1}^\infty \log(\pi_\theta(a_t | s_t)) G(\tau) \right] \\
\end{align}



## Alternative expressions for policy gradient


The generalized form of policy gradient for finite-horizon and undiscounted return is defined as:
$$\nabla_\theta J(\pi_\theta) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \big[\sum^T_{t=0}\nabla_\theta log \pi_\theta(a_t|s_t)R(\tau) \big]  $$ The sole components of this expression are already defined above in the derivation section albeit for the infinte horizon. However, to observe the alternate forms of this expression we would replace $R_\tau$ with $\Phi_\tau $:
$$ \nabla_\theta J(\pi_\theta) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \big[\sum^T_{t=0}\nabla_\theta log \pi_\theta(a_t|s_t)\Phi_\tau \big] $$ <br>


We can now use $\Phi$ to form alternate approaches for this expression which would yield the same results. For the sake of completeness we note down: $\Phi_\tau = R(\tau)$. Furthermore, $R(\tau)$ can be dissolved into $\sum^T_{t'=t}R(s_{t'},a_{t'},s_{t'+1}).$ The reason behind this is that $R(\tau)$ would mean we observe the sum of all rewards that were obtained however past rewards should not influence the Reinforcement of the action.Consequently, this means it would only be sensible to observe the rewards that come after the action which would be reinforced: $\Phi_\tau = \sum^T_{t'=t}R(s_{t'},a_{t'},s_{t'+1})$.  <br>

Another alteration is the use of __baselines__ in the expression. Baselines are used to solve the _high variance problem_ which is further explained in the last section but for now it will suffice to know that the average reward over the trajectory is subtracted by the current reward to reduce its high variance. $$\Phi_\tau = \sum^T_{t'=t}R(s_{t'},a_{t'},s_{t'+1})-b(s_t)$$
<br>

The on-policy action value function $ Q^\pi(s,a) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \big[R(\tau)|s_0=s, a_0=a\big]   $, that gives the expected return when starting in a state s and taking an action a  can also be an expression for $\Phi$. This is [proven](https://spinningup.openai.com/en/latest/spinningup/extra_pg_proof2.html) by using the law of iterated expectations and the result would give: $$\nabla_\theta J(\pi_\theta) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \big[\sum^T_{t=0}\nabla_\theta log \pi_\theta(a_t|s_t) Q^{\pi_{0}}(s,a) \big] $$ <br>



The Advantage Function $A^\pi = Q^\pi(s,a)- V^ßpi(s) $, that is used to calculate the advantage of an action over other actions, is proven in the same way as the action-value function and can also be inserted into the policy gradient expression: $$\nabla_\theta J(\pi_\theta) = \mathop{\mathbb{E}}_{\tau \sim \pi_\theta} \big[\sum^T_{t=0}\nabla_\theta log \pi_\theta(a_t|s_t) A^{\pi_{0}}(s_t,a_t) \big] $$
<br>

All of these expressions have one thing in common: it is the __same__ expected value of the policy gradient even though they differ in the form.



## simple policy gradient algorithm

reinforce or something similarly basic

## more advanced concepts?

baselines 

TRPO

PPO

A2C

## References

- [Lilian Wang: Policy Gradient Algorithms](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html)

- [OpenAI: Spinning Up](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#optional-formalism)
