About Variance Reduction Method #20

daiquanyu · 2018-03-21T09:26:19Z

In section 2.1 of the paper, the authors mention that the reward term is replaced with its advantage function. I have read the source code, but I still have some questions about the implementation of the variance reduction method.

how is it implemented in the code?
if the baseline function is set to a constant as I found (0.5 in the code ?), how this constant is obtained?
the parameters are updated with the training, should the baseline function be set to different values?
what if we just use simple policy gradient?

Look forward to your reply. Thanks.

LantaoYu · 2018-03-21T11:53:05Z

Yes, in this implementation, we simply use a constant baseline function (0.5), which is approximately the average of the reward for all actions. Although the parameters are continuously updated, the expectation of all the rewards remain close to 0.5 and empirically using the expected reward as a baseline performs well. Indeed, the optimal baseline should be the expected reward weighted by gradient magnitudes, but for simplicity, we didn't use that. However, it is well known that naive policy gradient suffer from the high variance and advantage function is an effective way to reduce the variance.

LantaoYu closed this as completed Mar 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Variance Reduction Method #20

About Variance Reduction Method #20

daiquanyu commented Mar 21, 2018

LantaoYu commented Mar 21, 2018

About Variance Reduction Method #20

About Variance Reduction Method #20

Comments

daiquanyu commented Mar 21, 2018

LantaoYu commented Mar 21, 2018