Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update reward signals in parallel with policy #2362

Merged
merged 34 commits into from
Aug 13, 2019

Conversation

ervteng
Copy link
Contributor

@ervteng ervteng commented Jul 29, 2019

Updates reward signal in parallel with policy. This means that all batching must be handled by the Trainer, and not the reward signal (loses some generality). However, it produces quite a bit performance boost in training and a lot less code.

Note: I'm seeing about a 20-30% speedup using Curiosity (or GAIL + Curiosity) on CPU. I expect it to be bigger on GPU - would want to test before merging.

Slight change of behavior: Reported policy loss is not the absolute value of policy loss anymore.

@CLAassistant
Copy link

CLAassistant commented Jul 29, 2019

CLA assistant check
All committers have signed the CLA.

@ervteng ervteng changed the base branch from master to develop July 29, 2019 23:04
@ervteng ervteng requested a review from harperj July 29, 2019 23:11
@ervteng ervteng marked this pull request as ready for review August 1, 2019 01:21
Copy link
Contributor

@harperj harperj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me -- my only thought is about our discussion of this tying us to certain numbers of epochs, ordering of updates, etc. on each of the reward signals and that removing flexibility. On the other hand, maybe that's just an implementation detail best left up to the trainer. In any case, that's a bridge to cross when we come to it I think-- 🚢 🇮🇹

@ervteng
Copy link
Contributor Author

ervteng commented Aug 5, 2019

Code looks good to me -- my only thought is about our discussion of this tying us to certain numbers of epochs, ordering of updates, etc. on each of the reward signals and that removing flexibility. On the other hand, maybe that's just an implementation detail best left up to the trainer. In any case, that's a bridge to cross when we come to it I think-- 🚢 🇮🇹

One thought I had about this - the user really shouldn't be touching them IMO. For instance, setting num_epochs to > 1 for GAIL in SAC breaks training. This change will make it easier for us to enforce "good defaults" across trainers, which at the end of the day might be better.

@ervteng ervteng requested a review from harperj August 6, 2019 01:29
@ervteng
Copy link
Contributor Author

ervteng commented Aug 6, 2019

Turns out this broke multi-GPU. We're working on a fix - will wait until it's done before pushing.

Copy link
Contributor

@harperj harperj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, minor feedback on style. Did you test how this changes performance?

I also noticed the amount of changes to multi-GPU. Did we verify this works correctly on a multi-GPU machine?

ml-agents/mlagents/trainers/ppo/multi_gpu_policy.py Outdated Show resolved Hide resolved
@ervteng ervteng merged commit 34300b9 into develop Aug 13, 2019
@ervteng ervteng deleted the develop-parallelrewardupdate branch August 13, 2019 22:10
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants