REINFORCE PolicyEstimator loss #15

IbrahimSobh · 2016-10-29T18:11:28Z

Dear

The code is very clear :) thank you

I have Three comments/questions:

1-I have a question regarding the REINFORCE PolicyEstimator loss:

self.loss = -tf.log(self.picked_action_prob) * self.target

where self.target is the advantage = total_return - baseline_value

As you mentioned: Basically, we move our policy into a direction of more reward.

I think that, what we actually do is applying the Policy Gradient Theorem:

grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)].

where:
log(pi(s, a)) is tf.log(self.picked_action_prob)
and
Q(s, a) could be "total_return" or even better "advantage"

(Correct?)
I wonder how and why this works. could you please elaborate

2- Total Return
Because it is Monte Carlo, we wait until the episode end and make updates. regarding total_return

total_return = sum(discount_factor**i * t.reward for i, t in enumerate(episode[t:]))

In English, what you do is adding up the discounted actual rewards for each state in the episode given the future states of that state, correct?

for example: the episode has number of transitions (say 5), each transition has a reward, then the

empirical total_return of state_3 in transition_3 = reward 3 + discount_factor^2 * reward_4 + discount_factor^3 * reward_5

And this applies even if a state is visited more than once during the episode. In this case the same state can be updated more than once?please elaborate

3- In your comment for "Using TD methods"

Q-Value TD Target (for off policy training)
SARSA TD Target (for on policy training)

I think you want to say that we can use Q-Learning OR SARSA in this example (not both), correct?

I wonder if you will implement eligibility traces soon or at least give a hint how to implement in the simplest way.

Thank you very much

dennybritz · 2016-10-29T23:06:58Z

Yes, your understanding seems right. If you want to know why this works in detail I recommend watching the lecture or reading the book, they have goof explanations. Intuitively, you update the policy function towards actions that give a better reward.

In English, what you do is adding up the discounted actual rewards for each state in the episode given the future states of that state, correct?

Yes.

And this applies even if a state is visited more than once during the episode. In this case the same state can be updated more than once?

Yes. I don't think there's a reason why a state shouldn't be updated more than once.

think you want to say that we can use Q-Learning OR SARSA in this example (not both), correct?

I removed the comments from the notebook because they may be confusing. But yes, you can use either one, not both.

I wonder if you will implement eligibility traces soon or at least give a hint how to implement in the simplest way.

I will try to add that, but implementing some of the missing algorithms like A3C is probably higher priority.

IbrahimSobh · 2016-10-30T20:05:48Z

Thank you

IbrahimSobh · 2016-10-30T20:14:22Z

As far as I understand, the main difference between Actor Critic and A3C is that A3C is using multiple independent agents instead of one agent.

In other words:

A3C = Actor Critic + Some tricks such as Asynchronous multiple agent parameter updates.

correct?

dennybritz · 2016-10-31T01:05:57Z

Yes.

IbrahimSobh closed this as completed Oct 30, 2016

IbrahimSobh reopened this Oct 30, 2016

dennybritz closed this as completed Oct 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REINFORCE PolicyEstimator loss #15

REINFORCE PolicyEstimator loss #15

IbrahimSobh commented Oct 29, 2016 •

edited

dennybritz commented Oct 29, 2016

IbrahimSobh commented Oct 30, 2016

IbrahimSobh commented Oct 30, 2016

dennybritz commented Oct 31, 2016

REINFORCE PolicyEstimator loss #15

REINFORCE PolicyEstimator loss #15

Comments

IbrahimSobh commented Oct 29, 2016 • edited

dennybritz commented Oct 29, 2016

IbrahimSobh commented Oct 30, 2016

IbrahimSobh commented Oct 30, 2016

dennybritz commented Oct 31, 2016

IbrahimSobh commented Oct 29, 2016 •

edited