Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug-fix] Fix POCA LSTM, pad sequences in the back #5206

Merged
merged 14 commits into from Apr 5, 2021
Merged

Conversation

ervteng
Copy link
Contributor

@ervteng ervteng commented Mar 31, 2021

Proposed change(s)

This PR does two things:

  • Fixes an issue with POCA's LSTM when sequence_length < time_horizon.

  • Moves padding of sequences shorter than sequence_length to the back. This is because we store the initial memories for each sequence (both policy and critic), but since we pad from the front, short sequences will be initialized, then run through some zeros. For instance, given a trajectory [s0, s1, s2], we will store the initial memory m0. Then, when we update the model, we might have padded the trajectory [pad0, pad1, s0, s1, s2], but will put the memory m0 in the LSTM with pad0.

This PR will change the padding to [s0, s1, s2, pad0, pad1], so that m0 goes with s0. This will also let us implement burn-in very easily by masking the loss at the start of the sequence.

Pending: experiment results (currently running Hallway).

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

JIRA MLA-1873

Types of change(s)

  • Bug fix
  • New feature
  • Code refactor
  • Breaking change
  • Documentation update
  • Other (please describe)

Checklist

  • Added tests that prove my fix is effective or that my feature works
  • Updated the changelog (if applicable)
  • Updated the documentation (if applicable)
  • Updated the migration guide (if applicable)

Other comments

@ervteng ervteng requested a review from andrewcoh March 31, 2021 18:21
# Evaluate other trajectories, carrying over _mem after each
# trajectory
for seq_num in range(
1, math.ceil((num_experiences) / (self.policy.sequence_length))
0, math.floor((num_experiences) / (self.policy.sequence_length))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
0, math.floor((num_experiences) / (self.policy.sequence_length))
math.floor((num_experiences) / (self.policy.sequence_length))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
0, math.floor((num_experiences) / (self.policy.sequence_length))
num_experiences // self.policy.sequence_length

# For the last sequence, the initial memory should be the one at the
# end of this trajectory.
for _ in range(last_seq_len):
all_next_memories.append(ModelUtils.to_numpy(_mem.squeeze()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it matter what the memory in the padding is? Can we make this all zeros ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't in the padding, as the padding isn't added to the trajectory until later. This method only has unpadded data. Added a comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand that comment :

            # For the last sequence, the initial memory should be the one at the
            # end of this trajectory.

Can you give more info ?

@@ -381,116 +382,110 @@ def _evaluate_by_sequence_team(
num_experiences = self_obs[0].shape[0]
all_next_value_mem = AgentBufferField()
all_next_baseline_mem = AgentBufferField()
# In the buffer, the 1st sequence are the ones that are padded. So if seq_len = 3 and
# trajectory is of length 10, the 1st sequence is [pad,pad,obs].
# In the buffer, the last sequence are the ones that are padded. So if seq_len = 3 and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention in this comment that this is about LSTM and memories.


# Evaluate other trajectories, carrying over _mem after each
# trajectory
for seq_num in range(
1, math.ceil((num_experiences) / (self.policy.sequence_length))
0, math.floor((num_experiences) / (self.policy.sequence_length))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified now.

end = (seq_num + 1) * self.policy.sequence_length - (
self.policy.sequence_length - leftover
)
start = seq_num * self.policy.sequence_length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like there is a bunch of duplicate code with this part : https://github.com/Unity-Technologies/ml-agents/pull/5206/files#diff-01b42574a4de05de250e29039fc1bbc67b200f3a7c5ed0676b2eee549b943c11R95
Is it possible to combine some of it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoyingly similar but just slightly different enough to make it really hard to combine. The POCA version has to operate on group obs and actions as well 🤔

Comment on lines +469 to +471
all_next_value_mem.append(ModelUtils.to_numpy(_value_mem.squeeze()))
all_next_baseline_mem.append(
ModelUtils.to_numpy(_baseline_mem.squeeze())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we padding with this value and not NaN or zeros ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sequence at this point isn't padded - it's just the leftover bit that would go into a new sequence. E.g. if there were three sequences of length "2" [[s0, s1],[s2, s3],[s4]], at this point this is just s4. When we pad in the buffer, it will become [s4, 0], but not at this point.

With that said, only the 1st memory of each sequence is used. I'm repeating the memory rather than using zeros or ones so we don't need to allocate a new numpy array every time.

Comment on lines 109 to 110
last_seq_len = leftover
if last_seq_len > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
last_seq_len = leftover
if last_seq_len > 0:
if leftover > 0:

last_seq_len = leftover
if last_seq_len > 0:
for _obs in tensor_obs:
last_seq_obs = _obs[-last_seq_len:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why use last_seq_len instead of leftover. Or rename leftover when its first created.


# For the last sequence, the initial memory should be the one at the
# end of this trajectory.
for _ in range(last_seq_len):
Copy link
Contributor

@andrewcoh andrewcoh Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we add the last _mem for each extra leftover obs? Why isnt it enough to just append this once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the mem array to be the same length as all the other ones in the buffer. But you're right this isn't needed. To address @vincentpierre's comment above, we could also add _mem once and add a bunch of zeros/NaNs, but this saves us from having to create/allocate those empty arrays.

@ervteng
Copy link
Contributor Author

ervteng commented Apr 5, 2021

Experimental update:

image
Hallway trains a little faster than on main (green line is main, 2 others are 2 runs of this PR).

image
Value estimates are much cleaner. Value loss shows a similar difference. I suspect this is b/c of the lack of random-length zero-pad at the beginning of the sequence.

@@ -5,7 +5,7 @@ behaviors:
batch_size: 128
buffer_size: 1024
learning_rate: 0.0003
beta: 0.01
beta: 0.03
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add in the documentation that it may be good practice to increase beta with LSTM? Or do we think that this is a peculiarity of Hallway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's b/c of Hallway's sparse reward and reward structure that has a bunch of local maxima (at least 2, given the reward curve). Hard to say if we'd need to increase beta for other envs.
Our betas might just be too low across-the-board.

for signal_name in init_values.keys()
}

# When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes,
# When using LSTM, we need to divide the trajectory into sequences of equal length. Sometimes,

signal_name: [init_values[signal_name]]
for signal_name in init_values.keys()
}
# When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# When using LSTM, we need to divide the trajectory into sequences of even length. Sometimes,
# When using LSTM, we need to divide the trajectory into sequences of equal length. Sometimes,

for seq_num in range(
1, math.ceil((num_experiences) / (self.policy.sequence_length))
):
for seq_num in range(num_experiences // (self.policy.sequence_length)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for seq_num in range(num_experiences // (self.policy.sequence_length)):
for seq_num in range(num_experiences // self.policy.sequence_length):

@ervteng ervteng merged commit 2ce6810 into main Apr 5, 2021
@delete-merged-branch delete-merged-branch bot deleted the develop-lstm-padding branch April 5, 2021 22:42
ervteng pushed a commit that referenced this pull request Apr 8, 2021
* Pad buffer at the end

* Fix padding in optimizer value estimate

* Fix additional bugs and POCA

* Fix groupmate obs, add tests

* Update changelog

* Improve tests

* Address comments

* Fix poca test

* Fix buffer test

* Increase entropy for Hallway

* Add EOF newline

* Fix Behavior Name

* Address comments

(cherry picked from commit 2ce6810)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants