Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REINFORCE PolicyEstimator loss #15

Closed
IbrahimSobh opened this issue Oct 29, 2016 · 4 comments
Closed

REINFORCE PolicyEstimator loss #15

IbrahimSobh opened this issue Oct 29, 2016 · 4 comments

Comments

@IbrahimSobh
Copy link

IbrahimSobh commented Oct 29, 2016

Dear

The code is very clear :) thank you

I have Three comments/questions:

1-I have a question regarding the REINFORCE PolicyEstimator loss:

  • self.loss = -tf.log(self.picked_action_prob) * self.target

where self.target is the advantage = total_return - baseline_value

As you mentioned: Basically, we move our policy into a direction of more reward.

I think that, what we actually do is applying the Policy Gradient Theorem:

grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)].

where:
log(pi(s, a)) is tf.log(self.picked_action_prob)
and
Q(s, a) could be "total_return" or even better "advantage"

(Correct?)
I wonder how and why this works. could you please elaborate

2- Total Return
Because it is Monte Carlo, we wait until the episode end and make updates. regarding total_return

  • total_return = sum(discount_factor**i * t.reward for i, t in enumerate(episode[t:]))

In English, what you do is adding up the discounted actual rewards for each state in the episode given the future states of that state, correct?

for example: the episode has number of transitions (say 5), each transition has a reward, then the

empirical total_return of state_3 in transition_3 = reward 3 + discount_factor^2 * reward_4 + discount_factor^3 * reward_5

And this applies even if a state is visited more than once during the episode. In this case the same state can be updated more than once?please elaborate

3- In your comment for "Using TD methods"

  • Q-Value TD Target (for off policy training)
  • SARSA TD Target (for on policy training)

I think you want to say that we can use Q-Learning OR SARSA in this example (not both), correct?

I wonder if you will implement eligibility traces soon or at least give a hint how to implement in the simplest way.

Thank you very much

@dennybritz
Copy link
Owner

  1. Yes, your understanding seems right. If you want to know why this works in detail I recommend watching the lecture or reading the book, they have goof explanations. Intuitively, you update the policy function towards actions that give a better reward.

In English, what you do is adding up the discounted actual rewards for each state in the episode given the future states of that state, correct?

Yes.

And this applies even if a state is visited more than once during the episode. In this case the same state can be updated more than once?

Yes. I don't think there's a reason why a state shouldn't be updated more than once.

think you want to say that we can use Q-Learning OR SARSA in this example (not both), correct?

I removed the comments from the notebook because they may be confusing. But yes, you can use either one, not both.

I wonder if you will implement eligibility traces soon or at least give a hint how to implement in the simplest way.

I will try to add that, but implementing some of the missing algorithms like A3C is probably higher priority.

@IbrahimSobh
Copy link
Author

Thank you

@IbrahimSobh
Copy link
Author

As far as I understand, the main difference between Actor Critic and A3C is that A3C is using multiple independent agents instead of one agent.

In other words:

A3C = Actor Critic + Some tricks such as Asynchronous multiple agent parameter updates.

correct?

@dennybritz
Copy link
Owner

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants