You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Furthermore, may I ask if the gradients of policy and value are backpropagated to the world model? If I understand your code right, they are not as this line suggests. However, your paper mentions it in several places that value and policy gradients are propagated back through the dynamics. Therefore, I am afraid that I understand the code wrong. Please let me know if I made any mistakes.
Best,
Sherwin
The text was updated successfully, but these errors were encountered:
The value loss is not propagated through multi-step predictions because we stop the gradients around the value targets as usual, turning it into a per-step loss.
The actor loss is the negative of the lambda returns. This is backpropagated through imagined sequences of multiple states.
Precisely, the gradient flows from the predicted value of a future state through the neural network value function, through the sequence of earlier states, through the sampled action, into the actor.
The stop gradient your pointing to makes sure it doesn't flow further. In other words, we only consider how a current action influences future states and their values. But we don't consider how a current action influences future actions.
Hope this helps. Regarding pcont, please reply to the previous ticket on that topic so we keep the discussion organized and easier for others to follow.
Thanks for your explanation. I see now that the gradient of the actor comes directly from the predicted values, which is different from the traditional policy gradient method. That makes sense now. By the way, I've moved my question about pcon to the previous issue.
Hi,
Furthermore, may I ask if the gradients of policy and value are backpropagated to the world model? If I understand your code right, they are not as this line suggests. However, your paper mentions it in several places that value and policy gradients are propagated back through the dynamics. Therefore, I am afraid that I understand the code wrong. Please let me know if I made any mistakes.
Best,
Sherwin
The text was updated successfully, but these errors were encountered: