-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update reward signals in parallel with policy #2362
Conversation
ml-agents/mlagents/trainers/components/reward_signals/curiosity/signal.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me -- my only thought is about our discussion of this tying us to certain numbers of epochs, ordering of updates, etc. on each of the reward signals and that removing flexibility. On the other hand, maybe that's just an implementation detail best left up to the trainer. In any case, that's a bridge to cross when we come to it I think-- 🚢 🇮🇹
One thought I had about this - the user really shouldn't be touching them IMO. For instance, setting num_epochs to > 1 for GAIL in SAC breaks training. This change will make it easier for us to enforce "good defaults" across trainers, which at the end of the day might be better. |
Turns out this broke multi-GPU. We're working on a fix - will wait until it's done before pushing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, minor feedback on style. Did you test how this changes performance?
I also noticed the amount of changes to multi-GPU. Did we verify this works correctly on a multi-GPU machine?
Updates reward signal in parallel with policy. This means that all batching must be handled by the Trainer, and not the reward signal (loses some generality). However, it produces quite a bit performance boost in training and a lot less code.
Note: I'm seeing about a 20-30% speedup using Curiosity (or GAIL + Curiosity) on CPU. I expect it to be bigger on GPU - would want to test before merging.
Slight change of behavior: Reported policy loss is not the absolute value of policy loss anymore.