You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
I am a little confused regarding how the indices work for the replay buffer. Specifically, a "dummy" transition is repeatedly references in replay_buffer.py, and there are some +/- 1 made to the indices:
Yes, this is the correct interpretation of the replay buffer layout.
The reason I did this is to not store next_observation to save on memory, and only store arrays for observations, actions, rewards, and discounts.
For example, a typical episode in DMC is 1000 steps, which, with action_repeat=2, equates to 500 steps. So if we execute an episode like this
time_step=env.reset() # this corresponds to (obs_0, None, None)whilenottime_step.last():
action=agent.act(time_step.observation)
time_step=env.step(action) # this corresponds to (obs_i, action_i, reward_i)
it will produce 501 time steps that will be stored in the replay buffer. Now, to sample a transition from the replay buffer we first sample a random starting position from the [1, 500] interval by running
But the main idea is that you can use discount=0 to signal an episode end. Keep in my that discount is an internal discounting for the environment, rather that the agent discount (i.e. \gamma=0.99).
I am a little confused regarding how the indices work for the replay buffer. Specifically, a "dummy" transition is repeatedly references in
replay_buffer.py
, and there are some +/- 1 made to the indices:From reading the code, it seems like the storage layout is
Is that the correct interpretation of the memory layout? And if so, why is the offset used?
Thanks for open-sourcing this!!
The text was updated successfully, but these errors were encountered: