# Natural Policy Gradient (NPG)

Sources:
*   Natural Policy Gradients In Reinforcement Learning Explained https://arxiv.org/pdf/2209.01820
*   Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO https://youtu.be/xvRrgxcpaHY

Issues:
*   Optimization of policy is made using 'noisy' data (high variance);
*   High chance that during update policy will be 'pushed'/ changed to drastically.
*   If policy becomes bad (overshoots optimum parameters), new data gathered via that policy will be bad quality
*   Whole learning loop may break down

Solution:
*   Limit how much policy can change from one iteration to another
*   Use robust metric- distance in 'policy space' not 'policy parameter space'

# Extension of 'Vanilla' Policy Gradient (VPG)



Everything from notes [01_Simplest_Policy_Gradient_Implementations.ipynb](01_Simplest_Policy_Gradient_Implementations.ipynb) applies

-We want to perform gradient ascent in model's parameter space to maximize policy's performance $J(\pi_{\theta_t})$

_(remember that $J$ is expected rewards over all possible trajectories)_

So the update rule is
$$\theta_{t+1} = \theta_t + \alpha \nabla_{\theta} J(\pi_{\theta_t}) = \theta_t + \alpha \vec{g}$$
where step direction 
$$\boxed{\vec{g} = \nabla_{\theta} J(\pi_{\theta}) =  \frac{1}{|D|} \sum_{\tau \in D}R(\tau)\cdot\sum_{t=0}^{T} \nabla_{\theta} \ log  \ \pi_\theta(a_t|s_t) }$$
Where $R(\tau)$ is a reward for a particular trajectory $\tau$. Check different variants in a linked notebook.

# Robust distance metric: KL divergence

KL divergence [KL_Divergence.ipynb](../../Statistics/KL_Divergence.ipynb) has few qualitative properties as euclidean distance
*   distance between same policy is 0
    $$D_{KL}(p||p) = \int p(x) \log \frac{p(x)}{p(x)} dx = \int p(x) \cdot (\log p(x)- \log p(x)) \ dx = \int 0 dx$$
*   distance between policies is greater or equal to 0 (no easy proof)

Downsides:
* unlike geometric distance, not symmetric

Why use in NPG?
*   Small policy's parameter change (step size) makes policy learning more stable. 

*   This 'smooth' trajectory though parameter space can be enforced by keeping old policy 'similar enough' to new policy.

***

# KL Divergence and Fisher information

KL divergence is related to Fisher Information (Matrix) [Fisher_Information.ipynb (ending)](../../Statistics/Fisher_Information.ipynb)
$$\boxed{D_{KL}\bigg(\pi(x; \theta) ||\pi(x; \theta + \delta) \bigg) \overset{\text{2nd order}}{\approx} \frac{1}{2}\delta^T \mathbb{E}\bigg[\nabla_\theta \log \pi(x; \theta) \ \nabla_\theta \log \pi(x; \theta)^T\bigg] \delta}$$
Where expectation is Fisher information $F$ (or $I(\theta)$)
$$F = \mathbb{E}\bigg[\nabla_\theta \log \pi(x; \theta) \ \nabla_\theta \log \pi(x; \theta)^T\bigg]$$
and perturbation $\delta$ is difference between new and old policy (parameter) variants
$$\delta = \theta_{new} - \theta_{old}$$
> Relation of KL divergence and Fisher Information Matrix (FIM) shows that<br>
> FIM describes how 'sensitive' distribution is to small deviations in parametrization.

__NOTE:__<br>
_Similar impementations stop at the intermediate step and say_
$$D_{KL}\bigg(\pi(x; \theta) ||\pi(x; \theta + \delta) \bigg) \overset{\text{2nd order}}{\approx} \frac{1}{2}\delta^T \nabla^2 D_{KL} \delta$$
_Which requires to calculate KL Divergence. We will use it in TRPO method._

# (Optional) Why is curvature important? 

Specifically, descent method is augmented with information about second order derivatives (curvature).

Lets remind us that in 2nd order optimization we use Taylor expansion and search for its minima $\vec{x}^*$.    
$$f(\vec{x}) \approx f(\vec{x}_0) + \nabla f(\vec{x}_0) \cdot (\vec{x} - \vec{x}_0) + \frac{1}{2} (\vec{x} - \vec{x}_0)\cdot H(\vec{x}_0) (\vec{x} - \vec{x}_0)$$

$$\nabla f(\vec{x}^*) = \nabla f(\vec{x}_0) + H(\vec{x}_0) (\vec{x}^*- \vec{x}_0) = \vec{0}$$

$$\vec{x}^* = \vec{x}_0 - H(\vec{x}_0)^{-1}  \nabla f(\vec{x}_0)$$

Here a _Hessian matrix_ $H$ provides information about the curvature of objective function.

_Intuition: Curvature - rate of change of slope, or slope on steroids xd. Hessian is inverted, since if curvature is low -  we are on plateau, we want to scale steps larger (1/small = big). For high curvature we want an opposite effect._

It is an elegant method to adapt step size, but it is computationally expensive.
([Notes_Second_Order_Methods.ipynb](../../optimization/Notes_Second_Order_Methods.ipynb))
***

# Optimization problem

Optimization goal for each iteration could be to find such step:
$$\boxed{\delta^* = \underset{s.t. \ D_{KL}\big(\pi(\theta)||\pi(\theta+\delta)\big) \lt \epsilon}{\argmax_\delta} J(\theta + \delta)}$$
_(https://www.andrew.cmu.edu/course/10-703/slides/Lecture_NaturalPolicyGradientsTRPOPPO.pdf)_

We can soften constrain by forming an _unconstrained_ objective function

Constraint is violated if $$D_{KL}\big(\pi(\theta)||\pi(\theta+\delta)\big) > \epsilon$$ 
So penalty function is
$$P_{KL}(\delta) = D_{KL}\big(\pi(\theta)||\pi(\theta+\delta)\big)- \epsilon > 0$$

Uncostrained objective function
$$f_U(\delta) = J(\theta + \delta) - \lambda P_{KL}(\delta)$$
Term $- \lambda P_{KL}(\delta)$ brings objective function down.
Unconstrained optimization problem:
$$\delta^* = \argmax_\delta J(\theta + \delta) - \lambda P_{KL}(\delta)$$
$$\boxed{\delta^* = \argmax_\delta J(\theta + \delta) - \lambda \bigg(D_{KL}\big(\pi(\theta)||\pi(\theta+\delta)\big)- \epsilon\bigg)}$$
***

# Finding optimization direction

We approximate $J(\theta + \delta)$ up to 2nd order via Taylor's expansion and get

$$\delta^* = \argmax_\delta J(\theta) + \nabla_\theta J(\theta)\big|_{\theta = \theta_{old}}\delta - \lambda D_{KL}\big(\pi(\theta)||\pi(\theta+\delta)\big)+ \lambda \epsilon$$
$$ = \argmax_\delta  \nabla_\theta J(\theta)_{old}\delta - \frac{\lambda}{2}\delta^T F|_{\theta = \theta_{old}} \delta \underbrace{+ \lambda \epsilon + J(\theta)}_{\text{not important for optimization}}$$
Our maximization objective function is 
$$f = \nabla_\theta J(\theta)_{old}\delta - \frac{\lambda}{2}\delta^T F|_{\theta = \theta_{old}} \delta$$
_We can change maximization task into minimization by multiplying objective by -1._

Searching for extremum we compute a derivative and set it to zero:
$$\frac{\partial}{\partial \delta} \bigg(\nabla_\theta J(\theta)_{old}\delta - \frac{\lambda}{2}\delta^T F|_{\theta = \theta_{old}} \delta\bigg) = 0$$
and we get
$$0 = \nabla_\theta J(\theta)_{old} - \frac{\lambda}{2}F|_{\theta = \theta_{old}} \delta^*$$
$$\frac{\lambda}{2}F|_{\theta = \theta_{old}} \delta^* = \nabla_\theta J(\theta)_{old} $$
$$ \delta^* = \frac{2}{\lambda} F^{-1}\nabla_\theta J(\theta) $$
Which is a modified original step direction/size. 

Modify gradient ascent - paramter update
$$\theta_{t+1} = \theta_t + \beta \delta^*$$
By absorbing constants into $\alpha$ our iteration update is
$$\theta_{t+1} = \theta_t + \alpha  F^{-1}\nabla_\theta J(\theta) = \theta_t + \alpha  F^{-1}\vec{g}$$
We retrieve intermediate version of NPG parameter update rule
$$\boxed{\theta_{t+1} = \theta_t+ \alpha  F^{-1}\vec{g}}$$
***

# Finding step size via KL threshold value $\epsilon$

Yet we still have unknown step size $\alpha$ and KL divergence threshold has be discarded during optimization.

We can retrieve step size via "normalization under the Fisher metric":
$$D_{KL}\bigg(\pi(x; \theta) ||\pi(x; \theta + \alpha\delta^*) \bigg) = \frac{\alpha^2}{2}\delta^{*T} F \delta^* \leq \epsilon$$
$$\frac{\alpha^2}{2}\delta^{*T} F \delta^* = \frac{\alpha^2}{2}\bigg(F^{-1}\vec{g}\bigg)^T F \bigg(F^{-1}\vec{g}\bigg) \leq \epsilon$$
$$ = \frac{\alpha^2}{2} \vec{g}^T {F^{-1}}^T F F^{-1}\vec{g}\leq \epsilon$$
Fisher information, like Hessian, due to second derivatives is a symmetric matrix: 
$${F^{-1}}^T F = {F^{-1}}^T F^T = (F F^{-1})^T = I^T$$
Almost done
$$\frac{\alpha^2}{2} \vec{g}^T {F^{-1}}\vec{g}\leq \epsilon$$
$$\alpha^2 \leq \frac{2 \epsilon}{\vec{g}^T {F^{-1}}\vec{g}}$$
At maximum ('$\leq$' $\longrightarrow$ '=' ), $\alpha$ should be 
$$\boxed{\alpha = \sqrt{\frac{2\epsilon}{\nabla \vec{g}^T {F^{-1}}\vec{g}}}}$$

# Final expression

$$\theta_{t+1} = \theta_t + \alpha  F^{-1}\vec{g}$$
update rule is the following
$$\boxed{\theta_{t+1} = \theta_t + \sqrt{\frac{2\epsilon}{\vec{g}^T {F^{-1}} \vec{g}}} F^{-1}\vec{g}}$$