Transition Policy Gradients #10

gyh75520 · 2018-09-25T13:24:02Z

From the paper：
in fact the properform for the transition policy gradient arrived at in eqn.10.

manager_loss = -tf.reduce_sum((self.r-cutoff_vf_manager)*dcos) ( from code )
why not implement the eqn 10.

biggzlar · 2019-01-30T14:58:12Z

Because the simpler, heuristic form in eqn. 7 is in fact the proper form of the more complex and (probably) less robust eqn. 10. Eqn. 10 is the gradient of a policy over states, instead of state-space directions.

Here's an intuition: We tell the agent to find a real world address. Eqn. 10 suggests intermittent addresses to help the agent find the final one - and the agent is rewarded every time they find one of the addresses suggested. Eqn. 7 suggests directions towards intermittent addresses and the agent is rewarded as soon as they follow the direction (so if the agent acts well, they get rewarded all the time, instead of sparsely).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transition Policy Gradients #10

Transition Policy Gradients #10

gyh75520 commented Sep 25, 2018

biggzlar commented Jan 30, 2019 •

edited

Transition Policy Gradients #10

Transition Policy Gradients #10

Comments

gyh75520 commented Sep 25, 2018

biggzlar commented Jan 30, 2019 • edited

biggzlar commented Jan 30, 2019 •

edited