New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor reward signals into separate class #2144
Changes from 127 commits
eb4abf2
d0852ac
4b15b80
ad9381b
8bf8302
d3e244e
a5b95f7
dc2fcaa
49cff40
48d3769
6eeb565
8ca7728
226b5c7
3386aa7
6799756
468c407
519e2d3
7df1a69
99237cd
9fa51c1
7f24677
4a714d0
3e2671d
77211d8
2334de8
439387e
ba793a3
a52ba0b
5b2ef22
13542b4
ae7a8b0
bf89082
360482b
c78639d
3b2485d
40bc9ba
c6e1504
60d9ff7
49ec682
d9847e0
dc7620b
28e0bd5
0257d2b
fd55c00
2343b3f
c74ad19
2dd7c61
42429a5
ec0e106
6ae1c2f
ef65bc2
8cbdbf4
3f35d45
3be9be7
9e9b4ff
d537a6b
713263c
ca5b948
671629e
a31c8a5
93cb4ff
6534291
700b478
71eedf5
83b4603
9e4b4e2
f814432
e10194f
fdcfb30
cb5e927
2c5c853
e66a343
0a98289
cd6e498
d23f6f3
d93e36e
1bf68c7
9da6e6c
4a57a32
11cc6f9
e66a6f7
bea2bc7
6302a55
d1cded9
2b98f3b
440146b
98f9160
5c923cb
e7ce888
858194f
28bceba
248cae4
744df94
31dabfc
a557f84
d4dbddb
ddb673b
975e05b
a83fd5d
fae7646
5cf98ac
d1afc9b
80f2c75
394b25a
a9724a3
66fef61
0e3be1d
015f50d
7c3059b
16c3c06
1fbfa5d
f9a3808
ce551bf
3e7ea5b
a40d8be
abc66cc
1aa0fc5
3bccf7f
aab7165
bbbb2e9
b5ca952
31cf875
53a472d
29e93f2
133a258
bae045d
b03de8f
3499d60
411db72
84478bf
a5f148c
3733a68
32815f5
70f7407
0d02b24
bba9d7d
de2b5d5
adc9915
6a1d8d1
554b1c2
52d3974
471b489
ed5e84e
87d77b4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,10 @@ observations to the best action an agent can take in a given state. The | |
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate | ||
Python process (communicating with the running Unity application over a socket). | ||
|
||
To train an agent, you will need to provide the agent one or more reward signals which | ||
the agent should attempt to maximize. See [Reward Signals](Training-RewardSignals.md) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "the agent There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will replace with "will learn to maximize" |
||
for the available reward signals and the corresponding hyperparameters. | ||
|
||
See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the | ||
training program, `learn.py`. | ||
|
||
|
@@ -31,15 +35,18 @@ of performance you would like. | |
|
||
## Hyperparameters | ||
|
||
### Gamma | ||
### Reward Signals | ||
|
||
`gamma` corresponds to the discount factor for future rewards. This can be | ||
thought of as how far into the future the agent should care about possible | ||
rewards. In situations when the agent should be acting in the present in order | ||
to prepare for rewards in the distant future, this value should be large. In | ||
cases when rewards are more immediate, it can be smaller. | ||
In reinforcement learning, the goal is to learn a Policy that maximizes reward. | ||
At a base level, the reward is given by the environment. However, we could imagine | ||
rewarding the agent for various different behaviors. For instance, we could reward | ||
the agent for exploring new states, rather than just when an explicit reward is given. | ||
Furthermore, we could mix reward signals to help the learning process. | ||
|
||
Typical Range: `0.8` - `0.995` | ||
`reward_signals` provides a section to define [reward signals.](Training-RewardSignals.md) | ||
ML-Agents provides two reward signals by default, the Extrinsic (environment) reward, and the | ||
Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward | ||
environments. | ||
|
||
### Lambda | ||
|
||
|
@@ -184,30 +191,6 @@ the agent will need to remember in order to successfully complete the task. | |
|
||
Typical Range: `64` - `512` | ||
|
||
## (Optional) Intrinsic Curiosity Module Hyperparameters | ||
|
||
The below hyperparameters are only used when `use_curiosity` is set to true. | ||
|
||
### Curiosity Encoding Size | ||
|
||
`curiosity_enc_size` corresponds to the size of the hidden layer used to encode | ||
the observations within the intrinsic curiosity module. This value should be | ||
small enough to encourage the curiosity module to compress the original | ||
observation, but also not too small to prevent it from learning the dynamics of | ||
the environment. | ||
|
||
Typical Range: `64` - `256` | ||
|
||
### Curiosity Strength | ||
|
||
`curiosity_strength` corresponds to the magnitude of the intrinsic reward | ||
generated by the intrinsic curiosity module. This should be scaled in order to | ||
ensure it is large enough to not be overwhelmed by extrinsic reward signals in | ||
the environment. Likewise it should not be too large to overwhelm the extrinsic | ||
reward signal. | ||
|
||
Typical Range: `0.1` - `0.001` | ||
|
||
## Training Statistics | ||
|
||
To view training statistics, use TensorBoard. For information on launching and | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# Reward Signals | ||
|
||
In reinforcement learning, the end goal for the Agent is to discover a behavior (a Policy) | ||
that maximizes a reward. Typically, a reward is defined by your environment, and corresponds | ||
to reaching some goal. These are what we refer to as "extrinsic" rewards, as they are defined | ||
external of the learning algorithm. | ||
|
||
Rewards, however, can be defined outside of the enviroment as well, to encourage the agent to | ||
behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these | ||
rewards as "intrinsic" reward signals. The total reward that the agent attempts to maximize can | ||
be a mix of extrinsic and intrinsic reward signals. | ||
|
||
ML-Agents allows reward signals to be defined in a modular way, and we provide three reward | ||
signals that can the mixed and matched to help shape your agent's behavior. The `extrinsic` Reward Signal represents the rewards defined in your environment, and is enabled by default. | ||
ervteng marked this conversation as resolved.
Show resolved
Hide resolved
|
||
The `curiosity` reward signal helps your agent explore when extrinsic rewards are sparse. | ||
|
||
## Enabling Reward Signals | ||
|
||
Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An | ||
example is provided in `config/trainer_config.yaml`. To enable a reward signal, add it to the | ||
`reward_signals:` section under the brain name. For instance, to enable the extrinsic signal | ||
in addition to a small curiosity reward, you would define your `reward_signals` as follows: | ||
|
||
``` | ||
reward_signals: | ||
extrinsic: | ||
strength: 1.0 | ||
gamma: 0.99 | ||
curiosity: | ||
strength: 0.01 | ||
gamma: 0.99 | ||
encoding_size: 128 | ||
``` | ||
|
||
Each reward signal should define at least two parameters, `strength` and `gamma`, in addition | ||
to any class-specific hyperparameters. Note that to remove a reward signal, you should delete | ||
its entry entirely from `reward_signals`. At least one reward signal should be left defined | ||
at all times. | ||
|
||
## Reward Signal Types | ||
|
||
### The Extrinsic Reward Signal | ||
|
||
The `extrinsic` reward signal is simply the reward given by the | ||
[environment](Learning-Environment-Design.md). Remove it to force the agent | ||
to ignore the environment reward. | ||
|
||
#### Strength | ||
|
||
`strength` is the factor by which to multiply the raw | ||
reward. Typical ranges will vary depending on the reward signal. | ||
|
||
Typical Range: `0.01 - `1.0` | ||
ervteng marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Gamma | ||
|
||
`gamma` corresponds to the discount factor for future rewards. This can be | ||
thought of as how far into the future the agent should care about possible | ||
rewards. In situations when the agent should be acting in the present in order | ||
to prepare for rewards in the distant future, this value should be large. In | ||
cases when rewards are more immediate, it can be smaller. | ||
|
||
Typical Range: `0.8` - `0.995` | ||
|
||
### The Curiosity Reward Signal | ||
|
||
@chriselion | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is this line for ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm supposed to write it at some point. @ervteng do you want to leave this empty for now and I'll do it in another PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That works - let me add a one-liner for this PR so it isn't completely empty in this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like this removed / addressed before merge |
||
|
||
#### Strength | ||
|
||
In this case, `strength` corresponds to the magnitude of the curiosity reward generated | ||
by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough | ||
to not be overwhelmed by extrinsic reward signals in the environment. | ||
Likewise it should not be too large to overwhelm the extrinsic reward signal. | ||
|
||
Typical Range: `0.1 - `0.001` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The ` symbol is misplaced. Also range needs to be smaller to larger |
||
|
||
#### Gamma | ||
|
||
`gamma` corresponds to the discount factor for future rewards. | ||
|
||
Typical Range: `0.8` - `0.9` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the gamma this small really ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was from GAIL - I think it should be 0.995 actually. |
||
|
||
#### Encoding Size | ||
|
||
`encoding_size` corresponds to the size of the encoding used by the intrinsic curiosity model. | ||
This value should be small enough to encourage the ICM to compress the original | ||
observation, but also not too small to prevent it from learning to differentiate between | ||
demonstrated and actual behavior. | ||
|
||
Default Value: 64 | ||
Typical Range: `64` - `256` | ||
|
||
#### Learning Rate | ||
|
||
`learning_rate` is the learning rate used to update the intrinsic curiosity module. | ||
This should typically be decreased if training is unstable, and the curiosity loss is unstable. | ||
|
||
Default Value: `3e-4` | ||
Typical Range: `1e-5` - `1e-3` | ||
ervteng marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from .reward_signal import * |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from .signal import CuriosityRewardSignal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gamma is removed from here, does this mean that old versions of the config will no longer be compatible. Is this going to break people's stuff (I am okay with it) or is there a fallback ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the old versions of the config aren't compatible. Having
gamma
won't break anything, but it will end up using the defaultgamma
fromdefault_config
. We could auto-assign thegamma
here to the extrinsic gamma, but that would break the abstraction. I guess we'll just have to be careful in the migration guide.