Question regarding indices for replay buffer #8

snailrowen1337 · 2021-10-15T22:26:27Z

I am a little confused regarding how the indices work for the replay buffer. Specifically, a "dummy" transition is repeatedly references in replay_buffer.py, and there are some +/- 1 made to the indices:

idx = np.random.randint(0, episode_len(episode) - self._nstep + 1) + 1
obs = episode['observation'][idx - 1]
action = episode['action'][idx]
next_obs = episode['observation'][idx + self._nstep - 1]
reward = np.zeros_like(episode['reward'][idx])

From reading the code, it seems like the storage layout is

----------------------------------------------------------
rewards           |   None   |  reward_0  |  reward_1  |   ........
----------------------------------------------------------
observations      |   obs_0  |   obs_1    |   obs_2    |   ........
----------------------------------------------------------
actions           |    None  |  action_0  |  action_1  |   .......
----------------------------------------------------------

Is that the correct interpretation of the memory layout? And if so, why is the offset used?

Thanks for open-sourcing this!!

The text was updated successfully, but these errors were encountered:

denisyarats · 2021-10-16T16:41:46Z

Yes, this is the correct interpretation of the replay buffer layout.

The reason I did this is to not store next_observation to save on memory, and only store arrays for observations, actions, rewards, and discounts.

For example, a typical episode in DMC is 1000 steps, which, with action_repeat=2, equates to 500 steps. So if we execute an episode like this

time_step = env.reset() # this corresponds to (obs_0, None, None)
while not time_step.last():
  action = agent.act(time_step.observation)
  time_step = env.step(action) # this corresponds to (obs_i, action_i, reward_i)

it will produce 501 time steps that will be stored in the replay buffer. Now, to sample a transition from the replay buffer we first sample a random starting position from the [1, 500] interval by running

i = np.random.randint(0, episode_len(episode) - self._nstep + 1) + 1 
# episode_len(episode) -> 500, self._nstep -> 1

then we construct a transition like this

obs_{i-1}, action_i, reward_i, obs_i

If nstep > 1, we perform additional reward accumulation and take a corresponding obs_{i + nstep - 1}.

I hope this provides some intuition behind the design choice.

snailrowen1337 · 2021-10-17T15:45:44Z

That makes perfect sense, it should save a lot of memory, thanks!

It also seems like you're not storing the episodes being over, i.e. the dones. Is that just for empirical performance?

denisyarats · 2021-10-18T17:17:15Z

Yes, done are not stored, but we use discount to provide the same functionality. This is similar to what dm_env does, you can read more about it here: https://github.com/deepmind/dm_env/blob/master/docs/index.md.

But the main idea is that you can use discount=0 to signal an episode end. Keep in my that discount is an internal discounting for the environment, rather that the agent discount (i.e. \gamma=0.99).

denisyarats closed this as completed Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding indices for replay buffer #8

Question regarding indices for replay buffer #8

snailrowen1337 commented Oct 15, 2021

denisyarats commented Oct 16, 2021

snailrowen1337 commented Oct 17, 2021

denisyarats commented Oct 18, 2021

Question regarding indices for replay buffer #8

Question regarding indices for replay buffer #8

Comments

snailrowen1337 commented Oct 15, 2021

denisyarats commented Oct 16, 2021

snailrowen1337 commented Oct 17, 2021

denisyarats commented Oct 18, 2021