Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Question regarding indices for replay buffer #8

Closed
snailrowen1337 opened this issue Oct 15, 2021 · 3 comments
Closed

Question regarding indices for replay buffer #8

snailrowen1337 opened this issue Oct 15, 2021 · 3 comments

Comments

@snailrowen1337
Copy link

I am a little confused regarding how the indices work for the replay buffer. Specifically, a "dummy" transition is repeatedly references in replay_buffer.py, and there are some +/- 1 made to the indices:

idx = np.random.randint(0, episode_len(episode) - self._nstep + 1) + 1
obs = episode['observation'][idx - 1]
action = episode['action'][idx]
next_obs = episode['observation'][idx + self._nstep - 1]
reward = np.zeros_like(episode['reward'][idx])

From reading the code, it seems like the storage layout is

----------------------------------------------------------
rewards           |   None   |  reward_0  |  reward_1  |   ........
----------------------------------------------------------
observations      |   obs_0  |   obs_1    |   obs_2    |   ........
----------------------------------------------------------
actions           |    None  |  action_0  |  action_1  |   .......
----------------------------------------------------------

Is that the correct interpretation of the memory layout? And if so, why is the offset used?

Thanks for open-sourcing this!!

@denisyarats
Copy link
Contributor

Yes, this is the correct interpretation of the replay buffer layout.

The reason I did this is to not store next_observation to save on memory, and only store arrays for observations, actions, rewards, and discounts.

For example, a typical episode in DMC is 1000 steps, which, with action_repeat=2, equates to 500 steps. So if we execute an episode like this

time_step = env.reset() # this corresponds to (obs_0, None, None)
while not time_step.last():
  action = agent.act(time_step.observation)
  time_step = env.step(action) # this corresponds to (obs_i, action_i, reward_i)

it will produce 501 time steps that will be stored in the replay buffer. Now, to sample a transition from the replay buffer we first sample a random starting position from the [1, 500] interval by running

i = np.random.randint(0, episode_len(episode) - self._nstep + 1) + 1 
# episode_len(episode) -> 500, self._nstep -> 1

then we construct a transition like this

obs_{i-1}, action_i, reward_i, obs_i

If nstep > 1, we perform additional reward accumulation and take a corresponding obs_{i + nstep - 1}.

I hope this provides some intuition behind the design choice.

@snailrowen1337
Copy link
Author

That makes perfect sense, it should save a lot of memory, thanks!

It also seems like you're not storing the episodes being over, i.e. the dones. Is that just for empirical performance?

@denisyarats
Copy link
Contributor

Yes, done are not stored, but we use discount to provide the same functionality. This is similar to what dm_env does, you can read more about it here: https://github.com/deepmind/dm_env/blob/master/docs/index.md.

But the main idea is that you can use discount=0 to signal an episode end. Keep in my that discount is an internal discounting for the environment, rather that the agent discount (i.e. \gamma=0.99).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants