# Notes

> Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks.

Using policy gradient methods for reinforcement learning would be nice, but large numbers of samples are usually needed and obtaining stable and steady improvement is challenging given the non-stationarity of data.

They use value functions to reduce the variance of policy gradient estimates (and remove some bias for this), and they use a trust region to improve non-stationarity.

> Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground.

Highly effective for locomotion and other robotic control tasks.

> A key source of difficulty is the long time delay between actions and their positive or negative effect on rewards; this issue is called the credit assignment problem.

> Value functions offer an elegant solution to the credit assignment problem—they allow us to estimate the goodness of an action before the delayed reward arrives.

Value functions are a good solution to the delayed reward problem without having to back propagate reward through thousands of time steps.

> Unfortunately, the variance of the gradient estimator scales unfavorably with the time horizon, since the effect of an action is confounded with the effects of past and future actions.

> We propose a family of policy gradient estimators that significantly reduce variance while maintaining a tolerable level of bias.

They propose a method that gets the perfect balance between collecting signal from samples/minimizing variance while still keeping enough variance to not get caught in a local maximum.

> We present experimental results on a number of highly challenging 3D locomotion tasks, where we show that our approach can learn complex gaits using high-dimensional, general purpose neural network function approximators for both the policy and the value function.

They propose a method for variance reduction in policy gradient methods called generalized advantage estimation, and a trust region optimization method for the value function.

> We will introduce a parameter $\gamma$ that allows us to reduce variance by down-weighting rewards corresponding to delayed effects, at the cost of introducing bias.

### Advantage Function Estimation

They use discounted future rewards and advantage calculation to get the discounted approximation of the policy gradient.

They estimate the value of the advantage function since it can’t be directly obtained.

$$
\hat{g} = \frac{1}{N} \sum_{n=1}^N\sum_{t=0}^\infty \hat{A_t^n} \nabla_\theta \log \pi_\theta(a_t^n | s_t^n)
$$

> One can interpret λ as an extra discount factor applied after performing a reward shaping transformation on the MD

They use value function estimation to obtain an approximate value function for optimization.

### Experiments

They use the following algorithm for generalized advantage estimation with TRPO for trust region update step.

![Screenshot 2024-12-05 at 5.54.28 PM.png](../../../images/Screenshot_2024-12-05_at_5.54.28_PM.png)

They use MuJoCo to simulate biped and quadruped locomotion.

![Screenshot 2024-12-05 at 5.56.51 PM.png](../../../images/Screenshot_2024-12-05_at_5.56.51_PM.png)

> The result after 1000 iterations is a fast, smooth, and stable gait that is effectively completely stable.

![Screenshot 2024-12-05 at 5.56.58 PM.png](../../../images/Screenshot_2024-12-05_at_5.56.58_PM.png)

We can see from these graphs that without a value function, the policy takes a long time to improve and never comes close to the best case error. With optimal $\lambda$ and $\gamma$ the model converges much faster.

### Discussion

> Policy gradient methods provide a way to reduce reinforcement learning to stochastic gradient descent, by providing unbiased gradient estimates.

Policy gradient methods are effective but have had challenges solving robotic control because of their sample complexity.

> We have argued that the key to variance reduction is to obtain good estimates of the advantage function.

> Our main experimental validation of generalized advantage estimation is in the domain of simulated robotic locomotion.
