You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In section 2.1 of the paper, the authors mention that the reward term is replaced with its advantage function. I have read the source code, but I still have some questions about the implementation of the variance reduction method.
how is it implemented in the code?
if the baseline function is set to a constant as I found (0.5 in the code ?), how this constant is obtained?
the parameters are updated with the training, should the baseline function be set to different values?
what if we just use simple policy gradient?
Look forward to your reply. Thanks.
The text was updated successfully, but these errors were encountered:
Yes, in this implementation, we simply use a constant baseline function (0.5), which is approximately the average of the reward for all actions. Although the parameters are continuously updated, the expectation of all the rewards remain close to 0.5 and empirically using the expected reward as a baseline performs well. Indeed, the optimal baseline should be the expected reward weighted by gradient magnitudes, but for simplicity, we didn't use that. However, it is well known that naive policy gradient suffer from the high variance and advantage function is an effective way to reduce the variance.
In section 2.1 of the paper, the authors mention that the reward term is replaced with its advantage function. I have read the source code, but I still have some questions about the implementation of the variance reduction method.
Look forward to your reply. Thanks.
The text was updated successfully, but these errors were encountered: