Question about action selection #12

rajcscw · 2018-02-25T09:26:47Z

In case of policy gradients, we try to approximate a softmax policy from which we sample actions based on probabilities stochastically.

How about in ES in case of discrete action space? Does the method follow greedy policy or softmax policy? From the code, I could see it is greedy policy, is it the right behavior?

atgambardella · 2018-03-04T14:38:49Z

It isn't necessary to use a stochastic policy during training for ES, as we don't need to take noisy actions to explore -- the exploration is solely done through noise on the weights.

rajcscw · 2018-03-04T21:22:28Z

Thanks for clearing this up, I also referred the paper again, it seems they are using deterministic policy as you said.

atgambardella closed this as completed Mar 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about action selection #12

Question about action selection #12

rajcscw commented Feb 25, 2018

atgambardella commented Mar 4, 2018

rajcscw commented Mar 4, 2018

Question about action selection #12

Question about action selection #12

Comments

rajcscw commented Feb 25, 2018

atgambardella commented Mar 4, 2018

rajcscw commented Mar 4, 2018