Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are policy and value gradients propagated back through the world model? #3

Closed
xlnwel opened this issue Feb 26, 2020 · 2 comments
Closed

Comments

@xlnwel
Copy link

xlnwel commented Feb 26, 2020

Hi,

Furthermore, may I ask if the gradients of policy and value are backpropagated to the world model? If I understand your code right, they are not as this line suggests. However, your paper mentions it in several places that value and policy gradients are propagated back through the dynamics. Therefore, I am afraid that I understand the code wrong. Please let me know if I made any mistakes.

Best,

Sherwin

@danijar
Copy link
Owner

danijar commented Feb 27, 2020

The value loss is not propagated through multi-step predictions because we stop the gradients around the value targets as usual, turning it into a per-step loss.

The actor loss is the negative of the lambda returns. This is backpropagated through imagined sequences of multiple states.

Precisely, the gradient flows from the predicted value of a future state through the neural network value function, through the sequence of earlier states, through the sampled action, into the actor.

The stop gradient your pointing to makes sure it doesn't flow further. In other words, we only consider how a current action influences future states and their values. But we don't consider how a current action influences future actions.

Hope this helps. Regarding pcont, please reply to the previous ticket on that topic so we keep the discussion organized and easier for others to follow.

@danijar danijar closed this as completed Feb 27, 2020
@xlnwel
Copy link
Author

xlnwel commented Feb 28, 2020

Hi,

Thanks for your explanation. I see now that the gradient of the actor comes directly from the predicted values, which is different from the traditional policy gradient method. That makes sense now. By the way, I've moved my question about pcon to the previous issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants