-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REINFORCE PolicyEstimator loss #15
Comments
Yes.
Yes. I don't think there's a reason why a state shouldn't be updated more than once.
I removed the comments from the notebook because they may be confusing. But yes, you can use either one, not both.
I will try to add that, but implementing some of the missing algorithms like A3C is probably higher priority. |
Thank you |
As far as I understand, the main difference between Actor Critic and A3C is that A3C is using multiple independent agents instead of one agent. In other words: A3C = Actor Critic + Some tricks such as Asynchronous multiple agent parameter updates. correct? |
Yes. |
Dear
The code is very clear :) thank you
I have Three comments/questions:
1-I have a question regarding the REINFORCE PolicyEstimator loss:
where self.target is the advantage = total_return - baseline_value
As you mentioned: Basically, we move our policy into a direction of more reward.
I think that, what we actually do is applying the Policy Gradient Theorem:
grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)].
where:
log(pi(s, a)) is tf.log(self.picked_action_prob)
and
Q(s, a) could be "total_return" or even better "advantage"
(Correct?)
I wonder how and why this works. could you please elaborate
2- Total Return
Because it is Monte Carlo, we wait until the episode end and make updates. regarding total_return
In English, what you do is adding up the discounted actual rewards for each state in the episode given the future states of that state, correct?
for example: the episode has number of transitions (say 5), each transition has a reward, then the
empirical total_return of state_3 in transition_3 = reward 3 + discount_factor^2 * reward_4 + discount_factor^3 * reward_5
And this applies even if a state is visited more than once during the episode. In this case the same state can be updated more than once?please elaborate
3- In your comment for "Using TD methods"
I think you want to say that we can use Q-Learning OR SARSA in this example (not both), correct?
I wonder if you will implement eligibility traces soon or at least give a hint how to implement in the simplest way.
Thank you very much
The text was updated successfully, but these errors were encountered: