Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Regarding training the reward head #58

Closed
EdanToledo opened this issue May 13, 2023 · 1 comment
Closed

[Question] Regarding training the reward head #58

EdanToledo opened this issue May 13, 2023 · 1 comment

Comments

@EdanToledo
Copy link

EdanToledo commented May 13, 2023

Hello,

I had a quick question regarding the reward head that I was hoping you could help clarify. In the diagrams in the paper, you seem to predict the reward of the current transition from the successive state i.e. if you have the transition (s_1, a_1, r_1, s_2) you would predict r_1 using the posterior state encoded from s_2. This makes sense in my head as you need the action and the hidden state processed to get the reward. But in the code, it seems that reward is shifted back by 1. What I mean by this is it seems in the code that you process the sequence of observations to give you a sequence of posteriors and then you train the reward head on the reward sequence using these posteriors where the reward sequence starts from r_1 but the first posterior in the sequence is z_1. So I was wondering if you pad the reward sequence in the beginning or I am misunderstanding how the world model works. I've attached an image to illustrate my confusion:

dreamer

In my image, the _init variables are just the initial masked variables and/or the learned starting state. I assume the prev_action in this case is just zeros.

Thank you.

@danijar
Copy link
Owner

danijar commented May 14, 2023

Hi, good question. The time step alignment used in this repository is what is shown in the first line not the second line (and it's also what the newer Sutton & Barto book recommends):

r1=0, s1 -> a1 -> r2, s2 -> a2 -> ...
s1 -> a1 -> r1, s2 -> a2 -> ...

In other words, the reward at some index is the consequence of the action one index earlier, not of the action at the same index.

@danijar danijar closed this as completed May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants