Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor reward signals into separate class #2144

Merged
merged 144 commits into from Jul 3, 2019
Merged
Show file tree
Hide file tree
Changes from 127 commits
Commits
Show all changes
144 commits
Select commit Hold shift + click to select a range
eb4abf2
New version of GAIL
awjuliani Oct 9, 2018
d0852ac
Move Curiosity to separate class
awjuliani Oct 12, 2018
4b15b80
Curiosity fully working under new system
awjuliani Oct 12, 2018
ad9381b
Begin implementing GAIL
awjuliani Oct 12, 2018
8bf8302
fix discrete curiosity
vincentpierre Oct 12, 2018
d3e244e
Add expert demonstration
awjuliani Oct 13, 2018
a5b95f7
Remove notebook
awjuliani Oct 13, 2018
dc2fcaa
Record intrinsic rewards properly
awjuliani Oct 13, 2018
49cff40
Add gail model updating
awjuliani Oct 13, 2018
48d3769
Code cleanup
awjuliani Oct 15, 2018
6eeb565
Nested structure for intrinsic rewards
awjuliani Oct 15, 2018
8ca7728
Rename files
awjuliani Oct 15, 2018
226b5c7
Update models so files
awjuliani Oct 15, 2018
3386aa7
fix typo
awjuliani Oct 15, 2018
6799756
Add reward strength parameter
awjuliani Oct 15, 2018
468c407
Use dictionary of reward signals
awjuliani Oct 17, 2018
519e2d3
Remove reward manager
awjuliani Oct 17, 2018
7df1a69
Extrinsic reward just another type
awjuliani Oct 17, 2018
99237cd
Clean up imports
awjuliani Oct 17, 2018
9fa51c1
All reward signals use strength to scale output
awjuliani Oct 17, 2018
7f24677
produce scaled and unscaled reward
awjuliani Oct 18, 2018
4a714d0
Remove unused dictionary
awjuliani Oct 18, 2018
3e2671d
Current trainer config
awjuliani Oct 18, 2018
77211d8
Add discrete control and pyramid experimentation
awjuliani Oct 19, 2018
2334de8
Minor changes to GAIL
awjuliani Oct 20, 2018
439387e
Add relevant strength parameters
awjuliani Oct 21, 2018
ba793a3
Replace string
awjuliani Oct 21, 2018
a52ba0b
Add support for visual observations w/ GAIL
awjuliani Oct 31, 2018
5b2ef22
Finish implementing visual obs for GAIL
awjuliani Nov 1, 2018
13542b4
Include demo files
awjuliani Nov 1, 2018
ae7a8b0
Fix for RNN w/ GAIL
awjuliani Nov 1, 2018
bf89082
Keep track of reward streams separately
awjuliani Nov 2, 2018
360482b
Bootstrap value estimates separately
awjuliani Nov 2, 2018
c78639d
Add value head
awjuliani Nov 14, 2018
3b2485d
Use sepaprate value streams for each reward
awjuliani Nov 15, 2018
40bc9ba
Add VAIL
awjuliani Nov 15, 2018
c6e1504
Use adaptive B
awjuliani Nov 16, 2018
60d9ff7
Comments improvements
vincentpierre Jan 10, 2019
49ec682
Added comments and refactored a pievce of the code
vincentpierre Jan 10, 2019
d9847e0
Added Comments
vincentpierre Jan 10, 2019
dc7620b
Fix on Curriosity
vincentpierre Jan 11, 2019
28e0bd5
Fixed typo
vincentpierre Jan 11, 2019
0257d2b
Added a forgotten comment
vincentpierre Jan 11, 2019
fd55c00
Stabilized Vail learning. Still no learning for Walker
vincentpierre Jan 14, 2019
2343b3f
Fixing typo on curiosity when using visual input
vincentpierre Jan 17, 2019
c74ad19
Added some comments
vincentpierre Jan 17, 2019
2dd7c61
modified the hyperparameters
vincentpierre Jan 17, 2019
42429a5
Fixed some of the tests, will need to refactor the reward signals in …
vincentpierre Jan 19, 2019
ec0e106
Putting the has_updated fags inside each reward signal
vincentpierre Jan 22, 2019
6ae1c2f
Added comments for the GAIL update method
vincentpierre Jan 22, 2019
ef65bc2
initial commit
vincentpierre Jan 24, 2019
8cbdbf4
No more normalization after pre-training
vincentpierre Jan 24, 2019
3f35d45
Fixed large bug in Vail
vincentpierre Jan 30, 2019
3be9be7
BUG FIX VAIL : The noise dimension was wrong and the discriminator sc…
vincentpierre Feb 1, 2019
9e9b4ff
implemented discrete control pretraining
vincentpierre Feb 2, 2019
d537a6b
bug fixing
vincentpierre Feb 3, 2019
713263c
Bug fix, still not tested for recurrent
vincentpierre Feb 6, 2019
ca5b948
Fixing beta in GAIL so it will change properly
vincentpierre Mar 6, 2019
671629e
Allow for not specifying an extrinsic reward
Apr 19, 2019
a31c8a5
Rough implementation of annealed BC
Apr 24, 2019
93cb4ff
Fixes for rebase onto v0.8
Apr 24, 2019
6534291
Moved BC trainer out of reward_signals and code cleanup
Apr 25, 2019
700b478
Rename folder to "components"
Apr 25, 2019
71eedf5
Fix renaming in Curiosity
Apr 25, 2019
83b4603
Remove demo_aided as a required param
May 2, 2019
9e4b4e2
Make old BC compatible
May 2, 2019
f814432
Fix visual obs for curiosity
May 3, 2019
e10194f
Tweaks all around
May 9, 2019
fdcfb30
Add reward normalization and bug fix
May 9, 2019
cb5e927
Load multiple .demo files. Fix bug with csv nans
May 30, 2019
2c5c853
Remove reward normalization
May 30, 2019
e66a343
Rename demo_aided to pretraining
May 30, 2019
0a98289
Fix bc configs
May 30, 2019
cd6e498
Increase small val to prevent NaNs
May 30, 2019
d23f6f3
Fix init in components
May 31, 2019
d93e36e
Merge remote-tracking branch 'origin/develop' into develop-irl-ervin
May 31, 2019
1bf68c7
Fix PPO tests
May 31, 2019
9da6e6c
Refactor components into common location
May 31, 2019
4a57a32
Minor code cleanup
Jun 3, 2019
11cc6f9
Preliminary RNN support
Jun 5, 2019
e66a6f7
Revert regression with NaNs for LSTMs
Jun 6, 2019
bea2bc7
Better LSTM support for BC
Jun 6, 2019
6302a55
Code cleanup and black reformat
Jun 6, 2019
d1cded9
Remove demo_helper and reformat signal
Jun 6, 2019
2b98f3b
Tests for GAIL and curiosity
Jun 6, 2019
440146b
Fix Black again...
Jun 6, 2019
98f9160
Tests for BCModule and visual tests for RewardSignals
Jun 6, 2019
5c923cb
Refactor to new structure and use class generator
Jun 7, 2019
e7ce888
Generalize reward_signal interface and stats
Jun 8, 2019
858194f
Fix incorrect environment reward reporting
Jun 10, 2019
28bceba
Rename reward signals for consistency. clean up comments
Jun 10, 2019
248cae4
Default trainer config (for cloud testing)
Jun 10, 2019
744df94
Remove "curiosity_enc_size" from the regular params
Jun 10, 2019
31dabfc
Fix PushBlock config
Jun 10, 2019
a557f84
Revert Pyramids environment
Jun 10, 2019
d4dbddb
Fix indexing issue with add_experiences
Jun 11, 2019
ddb673b
Fix tests
Jun 11, 2019
975e05b
Change to BCModule
Jun 11, 2019
a83fd5d
Merge branch 'develop' into develop-irl-ervin
Jun 12, 2019
fae7646
Remove the bools for reward signals
Jun 12, 2019
5cf98ac
Make update take in a mini buffer rather than the
Jun 13, 2019
d1afc9b
Always reference reward signals name and not index
Jun 13, 2019
80f2c75
More code cleanup
Jun 13, 2019
394b25a
Clean up reward_signal abstract class
Jun 13, 2019
a9724a3
Fix issue with recording values
Jun 13, 2019
66fef61
Add use_actions to GAIL
Jun 17, 2019
0e3be1d
Add documentation for Reward Signals
Jun 17, 2019
015f50d
Add documentation for GAIL
Jun 17, 2019
7c3059b
Remove unused variables in BCModel
Jun 17, 2019
16c3c06
Remove Entropy Reward Signal
Jun 17, 2019
1fbfa5d
Change tests to use safe_load
Jun 17, 2019
f9a3808
Don't use mutable default
Jun 17, 2019
ce551bf
Set defaults in parent __init__ (Reward Signals)
Jun 17, 2019
3e7ea5b
Remove unneccesary lines
Jun 17, 2019
a40d8be
Remove new features
Jun 17, 2019
abc66cc
Add learning rate option to Curiosity
Jun 17, 2019
1aa0fc5
Correct docs for Reward Signals
Jun 17, 2019
3bccf7f
Revert trainer configs to develop ver
Jun 17, 2019
aab7165
Clean up BC files
Jun 17, 2019
bbbb2e9
Revert BC model
Jun 17, 2019
b5ca952
Revert some changes to trainer
Jun 17, 2019
31cf875
Some more trainer_config cleanup
Jun 17, 2019
53a472d
Make new trainer compatible with old BC
Jun 17, 2019
29e93f2
Merge branch 'develop' into develop-rewardsignalsrefactor
Jun 17, 2019
133a258
Fix black formats
Jun 17, 2019
bae045d
Fixes to typos and unneccessary enumerate()
Jun 18, 2019
b03de8f
Use NamedTuple and more code cleanup
Jun 18, 2019
3499d60
Recursive printing of hyperparams
Jun 18, 2019
411db72
Black format
Jun 18, 2019
84478bf
Doc fixes
Jun 19, 2019
a5f148c
Fixed comment for evaluate
Jun 19, 2019
3733a68
More doc tweaks
Jun 19, 2019
32815f5
Make PPO prints more generic
Jun 19, 2019
70f7407
fix crawler dynamic hyperparams
Jun 20, 2019
0d02b24
Clean up doc formatting
Jun 20, 2019
bba9d7d
Change setup.py so all packages are installed
Jun 20, 2019
de2b5d5
Tweak pyramids hyperparams
Jun 21, 2019
adc9915
More tweaks to Pyramids
Jun 21, 2019
6a1d8d1
curiosity doc section
Jul 1, 2019
554b1c2
Merge remote-tracking branch 'origin/develop' into develop-rewardsign…
Jul 1, 2019
52d3974
get mypy passing
Jul 1, 2019
471b489
Tweak Pyramids hyperparameters
Jul 1, 2019
ed5e84e
Merge branch 'develop-rewardsignalsrefactor' of github.com:Unity-Tech…
Jul 1, 2019
87d77b4
Call static function rather than class function
Jul 3, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
54 changes: 36 additions & 18 deletions config/trainer_config.yaml
Expand Up @@ -4,7 +4,6 @@ default:
beta: 5.0e-3
buffer_size: 10240
epsilon: 0.2
gamma: 0.99
hidden_units: 128
lambd: 0.95
learning_rate: 3.0e-4
Expand All @@ -17,14 +16,15 @@ default:
sequence_length: 64
summary_freq: 1000
use_recurrent: false
use_curiosity: false
curiosity_strength: 0.01
curiosity_enc_size: 128
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

BananaLearning:
normalize: false
batch_size: 1024
beta: 5.0e-3
batch_size: 1024
buffer_size: 10240
max_steps: 1.0e5

Expand Down Expand Up @@ -93,9 +93,7 @@ GoalieLearning:
normalize: false

PyramidsLearning:
use_curiosity: true
summary_freq: 2000
curiosity_strength: 0.01
curiosity_enc_size: 256
time_horizon: 128
batch_size: 128
Expand All @@ -105,11 +103,18 @@ PyramidsLearning:
beta: 1.0e-2
max_steps: 5.0e5
num_epoch: 3

sequence_length: 16
use_recurrent: false
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.01
gamma: 0.99
encoding_size: 128

VisualPyramidsLearning:
use_curiosity: true
curiosity_strength: 0.01
curiosity_enc_size: 256
time_horizon: 128
batch_size: 64
buffer_size: 2024
Expand All @@ -118,6 +123,14 @@ VisualPyramidsLearning:
beta: 1.0e-2
max_steps: 5.0e5
num_epoch: 3
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.01
gamma: 0.99
encoding_size: 256

3DBallLearning:
normalize: true
Expand All @@ -126,9 +139,7 @@ VisualPyramidsLearning:
summary_freq: 1000
time_horizon: 1000
lambd: 0.99
gamma: 0.995
beta: 0.001
use_curiosity: true

3DBallHardLearning:
normalize: true
Expand All @@ -137,8 +148,11 @@ VisualPyramidsLearning:
summary_freq: 1000
time_horizon: 1000
max_steps: 5.0e5
gamma: 0.995
beta: 0.001
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

TennisLearning:
normalize: true
Expand All @@ -150,19 +164,21 @@ CrawlerStaticLearning:
time_horizon: 1000
batch_size: 2024
buffer_size: 20240
gamma: 0.995
max_steps: 1e6
summary_freq: 3000
num_layers: 3
hidden_units: 512
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

CrawlerDynamicLearning:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2024
buffer_size: 20240
gamma: 0.995
max_steps: 1e6
summary_freq: 3000
num_layers: 3
Expand All @@ -174,11 +190,14 @@ WalkerLearning:
time_horizon: 1000
batch_size: 2048
buffer_size: 20480
gamma: 0.995
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If gamma is removed from here, does this mean that old versions of the config will no longer be compatible. Is this going to break people's stuff (I am okay with it) or is there a fallback ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the old versions of the config aren't compatible. Having gamma won't break anything, but it will end up using the default gamma from default_config. We could auto-assign the gamma here to the extrinsic gamma, but that would break the abstraction. I guess we'll just have to be careful in the migration guide.

max_steps: 2e6
summary_freq: 3000
num_layers: 3
hidden_units: 512
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

ReacherLearning:
normalize: true
Expand All @@ -197,7 +216,6 @@ HallwayLearning:
hidden_units: 128
memory_size: 256
beta: 1.0e-2
gamma: 0.99
num_epoch: 3
buffer_size: 1024
batch_size: 128
Expand Down
45 changes: 14 additions & 31 deletions docs/Training-PPO.md
Expand Up @@ -7,6 +7,10 @@ observations to the best action an agent can take in a given state. The
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
Python process (communicating with the running Unity application over a socket).

To train an agent, you will need to provide the agent one or more reward signals which
the agent should attempt to maximize. See [Reward Signals](Training-RewardSignals.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the agent will attempt the maximize." I know RL does not work but try to act like it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will replace with "will learn to maximize"

for the available reward signals and the corresponding hyperparameters.

See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the
training program, `learn.py`.

Expand All @@ -31,15 +35,18 @@ of performance you would like.

## Hyperparameters

### Gamma
### Reward Signals

`gamma` corresponds to the discount factor for future rewards. This can be
thought of as how far into the future the agent should care about possible
rewards. In situations when the agent should be acting in the present in order
to prepare for rewards in the distant future, this value should be large. In
cases when rewards are more immediate, it can be smaller.
In reinforcement learning, the goal is to learn a Policy that maximizes reward.
At a base level, the reward is given by the environment. However, we could imagine
rewarding the agent for various different behaviors. For instance, we could reward
the agent for exploring new states, rather than just when an explicit reward is given.
Furthermore, we could mix reward signals to help the learning process.

Typical Range: `0.8` - `0.995`
`reward_signals` provides a section to define [reward signals.](Training-RewardSignals.md)
ML-Agents provides two reward signals by default, the Extrinsic (environment) reward, and the
Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward
environments.

### Lambda

Expand Down Expand Up @@ -184,30 +191,6 @@ the agent will need to remember in order to successfully complete the task.

Typical Range: `64` - `512`

## (Optional) Intrinsic Curiosity Module Hyperparameters

The below hyperparameters are only used when `use_curiosity` is set to true.

### Curiosity Encoding Size

`curiosity_enc_size` corresponds to the size of the hidden layer used to encode
the observations within the intrinsic curiosity module. This value should be
small enough to encourage the curiosity module to compress the original
observation, but also not too small to prevent it from learning the dynamics of
the environment.

Typical Range: `64` - `256`

### Curiosity Strength

`curiosity_strength` corresponds to the magnitude of the intrinsic reward
generated by the intrinsic curiosity module. This should be scaled in order to
ensure it is large enough to not be overwhelmed by extrinsic reward signals in
the environment. Likewise it should not be too large to overwhelm the extrinsic
reward signal.

Typical Range: `0.1` - `0.001`

## Training Statistics

To view training statistics, use TensorBoard. For information on launching and
Expand Down
100 changes: 100 additions & 0 deletions docs/Training-RewardSignals.md
@@ -0,0 +1,100 @@
# Reward Signals

In reinforcement learning, the end goal for the Agent is to discover a behavior (a Policy)
that maximizes a reward. Typically, a reward is defined by your environment, and corresponds
to reaching some goal. These are what we refer to as "extrinsic" rewards, as they are defined
external of the learning algorithm.

Rewards, however, can be defined outside of the enviroment as well, to encourage the agent to
behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these
rewards as "intrinsic" reward signals. The total reward that the agent attempts to maximize can
be a mix of extrinsic and intrinsic reward signals.

ML-Agents allows reward signals to be defined in a modular way, and we provide three reward
signals that can the mixed and matched to help shape your agent's behavior. The `extrinsic` Reward Signal represents the rewards defined in your environment, and is enabled by default.
ervteng marked this conversation as resolved.
Show resolved Hide resolved
The `curiosity` reward signal helps your agent explore when extrinsic rewards are sparse.

## Enabling Reward Signals

Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An
example is provided in `config/trainer_config.yaml`. To enable a reward signal, add it to the
`reward_signals:` section under the brain name. For instance, to enable the extrinsic signal
in addition to a small curiosity reward, you would define your `reward_signals` as follows:

```
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.01
gamma: 0.99
encoding_size: 128
```

Each reward signal should define at least two parameters, `strength` and `gamma`, in addition
to any class-specific hyperparameters. Note that to remove a reward signal, you should delete
its entry entirely from `reward_signals`. At least one reward signal should be left defined
at all times.

## Reward Signal Types

### The Extrinsic Reward Signal

The `extrinsic` reward signal is simply the reward given by the
[environment](Learning-Environment-Design.md). Remove it to force the agent
to ignore the environment reward.

#### Strength

`strength` is the factor by which to multiply the raw
reward. Typical ranges will vary depending on the reward signal.

Typical Range: `0.01 - `1.0`
ervteng marked this conversation as resolved.
Show resolved Hide resolved

#### Gamma

`gamma` corresponds to the discount factor for future rewards. This can be
thought of as how far into the future the agent should care about possible
rewards. In situations when the agent should be acting in the present in order
to prepare for rewards in the distant future, this value should be large. In
cases when rewards are more immediate, it can be smaller.

Typical Range: `0.8` - `0.995`

### The Curiosity Reward Signal

@chriselion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this line for ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm supposed to write it at some point. @ervteng do you want to leave this empty for now and I'll do it in another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works - let me add a one-liner for this PR so it isn't completely empty in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like this removed / addressed before merge


#### Strength

In this case, `strength` corresponds to the magnitude of the curiosity reward generated
by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough
to not be overwhelmed by extrinsic reward signals in the environment.
Likewise it should not be too large to overwhelm the extrinsic reward signal.

Typical Range: `0.1 - `0.001`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ` symbol is misplaced. Also range needs to be smaller to larger


#### Gamma

`gamma` corresponds to the discount factor for future rewards.

Typical Range: `0.8` - `0.9`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the gamma this small really ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was from GAIL - I think it should be 0.995 actually.


#### Encoding Size

`encoding_size` corresponds to the size of the encoding used by the intrinsic curiosity model.
This value should be small enough to encourage the ICM to compress the original
observation, but also not too small to prevent it from learning to differentiate between
demonstrated and actual behavior.

Default Value: 64
Typical Range: `64` - `256`

#### Learning Rate

`learning_rate` is the learning rate used to update the intrinsic curiosity module.
This should typically be decreased if training is unstable, and the curiosity loss is unstable.

Default Value: `3e-4`
Typical Range: `1e-5` - `1e-3`
ervteng marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 1 addition & 1 deletion ml-agents/mlagents/trainers/bc/policy.py
Expand Up @@ -57,7 +57,7 @@ def evaluate(self, brain_info):
self.model.sequence_length: 1,
}

feed_dict = self._fill_eval_dict(feed_dict, brain_info)
feed_dict = self.fill_eval_dict(feed_dict, brain_info)
if self.use_recurrent:
if brain_info.memories.shape[1] == 0:
brain_info.memories = self.make_empty_memory(len(brain_info.agents))
Expand Down
Empty file.
@@ -0,0 +1 @@
from .reward_signal import *
@@ -0,0 +1 @@
from .signal import CuriosityRewardSignal