You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In case of policy gradients, we try to approximate a softmax policy from which we sample actions based on probabilities stochastically.
How about in ES in case of discrete action space? Does the method follow greedy policy or softmax policy? From the code, I could see it is greedy policy, is it the right behavior?
The text was updated successfully, but these errors were encountered:
It isn't necessary to use a stochastic policy during training for ES, as we don't need to take noisy actions to explore -- the exploration is solely done through noise on the weights.
In case of policy gradients, we try to approximate a softmax policy from which we sample actions based on probabilities stochastically.
How about in ES in case of discrete action space? Does the method follow greedy policy or softmax policy? From the code, I could see it is greedy policy, is it the right behavior?
The text was updated successfully, but these errors were encountered: