Update reward signals in parallel with policy #2362

ervteng · 2019-07-29T23:04:24Z

Updates reward signal in parallel with policy. This means that all batching must be handled by the Trainer, and not the reward signal (loses some generality). However, it produces quite a bit performance boost in training and a lot less code.

Note: I'm seeing about a 20-30% speedup using Curiosity (or GAIL + Curiosity) on CPU. I expect it to be bigger on GPU - would want to test before merging.

Slight change of behavior: Reported policy loss is not the absolute value of policy loss anymore.

CLAassistant · 2019-07-29T23:04:38Z

All committers have signed the CLA.

ml-agents/mlagents/trainers/components/reward_signals/curiosity/signal.py

ml-agents/mlagents/trainers/ppo/trainer.py

harperj

Code looks good to me -- my only thought is about our discussion of this tying us to certain numbers of epochs, ordering of updates, etc. on each of the reward signals and that removing flexibility. On the other hand, maybe that's just an implementation detail best left up to the trainer. In any case, that's a bridge to cross when we come to it I think-- 🚢 🇮🇹

ervteng · 2019-08-05T18:41:12Z

Code looks good to me -- my only thought is about our discussion of this tying us to certain numbers of epochs, ordering of updates, etc. on each of the reward signals and that removing flexibility. On the other hand, maybe that's just an implementation detail best left up to the trainer. In any case, that's a bridge to cross when we come to it I think-- 🚢 🇮🇹

One thought I had about this - the user really shouldn't be touching them IMO. For instance, setting num_epochs to > 1 for GAIL in SAC breaks training. This change will make it easier for us to enforce "good defaults" across trainers, which at the end of the day might be better.

ervteng · 2019-08-06T01:34:02Z

Turns out this broke multi-GPU. We're working on a fix - will wait until it's done before pushing.

This reverts commit 5c93b38.

This reverts commit c2a0125.

harperj

Changes look good to me, minor feedback on style. Did you test how this changes performance?

I also noticed the amount of changes to multi-GPU. Did we verify this works correctly on a multi-GPU machine?

ml-agents/mlagents/trainers/ppo/multi_gpu_policy.py

fix GAIL with this

Ervin Teng added 4 commits July 29, 2019 15:35

Make update and reward signal update happen at the same time

7253c2f

Convert GAIL too (needs verification)

4cb0d97

Cleanup curiosity

f94d2f8

Make sure GAIL has updated

709eaa5

ervteng changed the base branch from master to develop July 29, 2019 23:04

ervteng requested a review from harperj July 29, 2019 23:11

harperj reviewed Jul 29, 2019

View reviewed changes

ml-agents/mlagents/trainers/components/reward_signals/curiosity/signal.py Outdated Show resolved Hide resolved

Ervin Teng added 5 commits July 29, 2019 17:27

Fix tests

7a9c1d6

Fix naming conflict

2bb929f

Rename update to prepare_update

e031f25

Remove erroneous comments

b8613c0

Remove unused parameters and fix docs

a50b340

dongruoping reviewed Jul 31, 2019

View reviewed changes

ml-agents/mlagents/trainers/ppo/trainer.py Outdated Show resolved Hide resolved

ervteng marked this pull request as ready for review August 1, 2019 01:21

Ervin Teng added 4 commits July 31, 2019 18:46

Move update to policy

9a8d055

Fix issue where policy was updated once per reward signal

d1b1b04

merge latest develop

2bd266f

Fix merge with multi-GPU

7b7ff2c

harperj approved these changes Aug 5, 2019

View reviewed changes

Ervin Teng added 5 commits August 5, 2019 11:42

Fix broken GAIL

5d5cc22

Revert Update method

0fc5011

Make policy absolute again

4b125b5

Make curiosity use its own placeholders

c2a0125

GAIL no longer uses placeholders from Policy

5c93b38

ervteng requested a review from harperj August 6, 2019 01:29

Ervin Teng added 2 commits August 6, 2019 11:04

Revert "GAIL no longer uses placeholders from Policy"

229b21a

This reverts commit 5c93b38.

Revert "Make curiosity use its own placeholders"

f259e16

This reverts commit c2a0125.

Ervin Teng and others added 8 commits August 6, 2019 18:56

Merge buffer changes - tests are broken

89677da

multi-GPU reward signal

e6a1ff6

multi-GPU reward signal

da52a49

multi-GPU reward signal

863d1fc

multi-GPU reward signal

c7e406f

fix tests

fcdab98

fix shuffle

bd40f59

reformat

76ffe4b

harperj approved these changes Aug 12, 2019

View reviewed changes

ml-agents/mlagents/trainers/ppo/multi_gpu_policy.py Outdated Show resolved Hide resolved

Ervin Teng added 6 commits August 12, 2019 14:26

Clean up reward signals code

18ff8f9

Fix multi-gpu batching not being batched

a43f698

Make mini-batches contain int number of sequences

6a9fe99

fix GAIL with this

Merge latest develop

22764f1

Fix issue with merge

8555646

Fix issue with multi GPU inference dict

9dfd386

ervteng merged commit 34300b9 into develop Aug 13, 2019

ervteng deleted the develop-parallelrewardupdate branch August 13, 2019 22:10

github-actions bot locked as resolved and limited conversation to collaborators May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update reward signals in parallel with policy #2362

Update reward signals in parallel with policy #2362

ervteng commented Jul 29, 2019 •

edited

Loading

CLAassistant commented Jul 29, 2019 •

edited

Loading

harperj left a comment

ervteng commented Aug 5, 2019

ervteng commented Aug 6, 2019

harperj left a comment

Update reward signals in parallel with policy #2362

Update reward signals in parallel with policy #2362

Conversation

ervteng commented Jul 29, 2019 • edited Loading

CLAassistant commented Jul 29, 2019 • edited Loading

harperj left a comment

Choose a reason for hiding this comment

ervteng commented Aug 5, 2019

ervteng commented Aug 6, 2019

harperj left a comment

Choose a reason for hiding this comment

ervteng commented Jul 29, 2019 •

edited

Loading

CLAassistant commented Jul 29, 2019 •

edited

Loading