Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor reward signals into separate class #2144

Merged
merged 144 commits into from Jul 3, 2019

Conversation

@ervteng
Copy link
Collaborator

commented Jun 17, 2019

Refactors the reward signals (extrinsic and curiosity) into separate class that inherits from RewardSignal. The Trainer is now reward signal agnostic, and doesn't check the config whether or not each type exists - it just instantiates all of the classes declared.

This is in preparation for additional reward signals (e.g. GAIL) as well as reuse across different trainers.

Also, it is equivalent to the IRL PR but without the new features (GAIL and PreTraining). We're breaking up this PR into two to make it easier to review.

@chriselion

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2019

Overall looks pretty good to me. I'd feel better if someone more familiar with the specific models gave it a once-over though.

@@ -174,11 +190,14 @@ WalkerLearning:
time_horizon: 1000
batch_size: 2048
buffer_size: 20480
gamma: 0.995

This comment has been minimized.

Copy link
@vincentpierre

vincentpierre Jun 18, 2019

Collaborator

If gamma is removed from here, does this mean that old versions of the config will no longer be compatible. Is this going to break people's stuff (I am okay with it) or is there a fallback ?

This comment has been minimized.

Copy link
@ervteng

ervteng Jun 19, 2019

Author Collaborator

Yeah, the old versions of the config aren't compatible. Having gamma won't break anything, but it will end up using the default gamma from default_config. We could auto-assign the gamma here to the extrinsic gamma, but that would break the abstraction. I guess we'll just have to be careful in the migration guide.

@@ -7,6 +7,10 @@ observations to the best action an agent can take in a given state. The
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
Python process (communicating with the running Unity application over a socket).

To train an agent, you will need to provide the agent one or more reward signals which
the agent should attempt to maximize. See [Reward Signals](Training-RewardSignals.md)

This comment has been minimized.

Copy link
@vincentpierre

vincentpierre Jun 18, 2019

Collaborator

"the agent will attempt the maximize." I know RL does not work but try to act like it does.

This comment has been minimized.

Copy link
@ervteng

ervteng Jun 19, 2019

Author Collaborator

Will replace with "will learn to maximize"

docs/Training-RewardSignals.md Outdated Show resolved Hide resolved
docs/Training-RewardSignals.md Outdated Show resolved Hide resolved

### The Curiosity Reward Signal

@chriselion

This comment has been minimized.

Copy link
@vincentpierre

vincentpierre Jun 18, 2019

Collaborator

What is this line for ?

This comment has been minimized.

Copy link
@chriselion

chriselion Jun 18, 2019

Contributor

I'm supposed to write it at some point. @ervteng do you want to leave this empty for now and I'll do it in another PR?

This comment has been minimized.

Copy link
@ervteng

ervteng Jun 19, 2019

Author Collaborator

That works - let me add a one-liner for this PR so it isn't completely empty in this PR.

This comment has been minimized.

Copy link
@vincentpierre

vincentpierre Jun 20, 2019

Collaborator

I would like this removed / addressed before merge

docs/Training-RewardSignals.md Show resolved Hide resolved
"""
self.loss = 10 * (0.2 * self.forward_loss + 0.8 * self.inverse_loss)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
self.update_batch = optimizer.minimize(self.loss)

This comment has been minimized.

Copy link
@vincentpierre

vincentpierre Jun 18, 2019

Collaborator

new line ?

This comment has been minimized.

Copy link
@ervteng

ervteng Jun 19, 2019

Author Collaborator

Hey @vincentpierre, where would you like the new line?

This comment has been minimized.

Copy link
@vincentpierre

vincentpierre Jun 19, 2019

Collaborator

Ah, I am sorry I was not clear. I think there needs to be an empty line at the end of the document. It is possible that the line is there, just not appearing on github. In which case, ignore this comment.

This comment has been minimized.

Copy link
@ervteng

ervteng Jun 19, 2019

Author Collaborator

Black should catch this in CircleCI, but let me double-check. Sometimes VSCode does weird things with Black.

self.policy = policy
self.strength = strength

def evaluate(self, current_info, next_info):

This comment has been minimized.

Copy link
@vincentpierre

vincentpierre Jun 18, 2019

Collaborator

Can you specify the return type is RewardSignalResult for clarity.
Also, I am wondering if the scaling of the reward signal should be handled by the Reward signal or the trainer.

This comment has been minimized.

Copy link
@ervteng

ervteng Jun 19, 2019

Author Collaborator

Added the RewardSignalResult into the comment.

Hmm, we could go either way.
Pros of the current way: Trainer doesn't have to be aware of the strength, "Strength" could be defined differently for different reward signals.
Pros of doing it in the Trainer: more generic, would be much easier to add normalization of rewards in the future if we want to.

return {}

@classmethod
def check_config(cls, config_dict, param_keys=None):

This comment has been minimized.

Copy link
@vincentpierre

vincentpierre Jun 18, 2019

Collaborator

Should this be static ? Also, not only used in reward signals I think.

This comment has been minimized.

Copy link
@ervteng

ervteng Jun 19, 2019

Author Collaborator

It's not static so I can get the cls name.

But yeah, there is similar logic in the Trainer (though slightly different), though it's weird to call a Trainer function in RewardSignals. I wonder if it's time to make a utils.py file with all of these common functions? Also some common logic with minibatching and sequence computing might go in this file.

# Make sure we have at least one reward_signal
if not self.trainer_parameters["reward_signals"]:
raise UnityTrainerException(
"No reward signals were defined. At least one must be used with the PPO trainer."

This comment has been minimized.

Copy link
@vincentpierre

vincentpierre Jun 18, 2019

Collaborator

Use the class name, do not hard code PPO here ?

This comment has been minimized.

Copy link
@ervteng

ervteng Jun 19, 2019

Author Collaborator

Fixed.

BTW, I plan to make this even more generic, e.g. for printing the trainer_parameters, I'd move it to the base Trainer class and use the class name from there. Probably will come when we add SAC - with this PR I'm trying not to modify too much

ervteng added 4 commits Jun 19, 2019
@@ -64,7 +65,7 @@ Typical Range: `0.8` - `0.995`

### The Curiosity Reward Signal

@chriselion
The `curiosity` Reward Signal enables the Intrinsic Curiosity Module.

This comment has been minimized.

Copy link
@vincentpierre

vincentpierre Jun 20, 2019

Collaborator

Link to paper ?

ervteng and others added 6 commits Jun 20, 2019
@CLAassistant

This comment has been minimized.

Copy link

commented Jul 1, 2019

CLA assistant check
All committers have signed the CLA.

which correspond to the agents in a provided next_info.
:BrainInfo next_info: A t+1 BrainInfo.
:return: curr_info: Reconstructed BrainInfo to match agents of next_info.
"""
visual_observations = [[]]
visual_observations = []

This comment has been minimized.

Copy link
@chriselion

chriselion Jul 1, 2019

Contributor

Is this correct? We do visual_observations[i].append(...) below, but never resize this directly. (I think it may have been broken before for >1 visual observations too)

This comment has been minimized.

Copy link
@ervteng

ervteng Jul 3, 2019

Author Collaborator

This was changed in the last commit (with types) - just marking for reference

@ervteng ervteng merged commit f4bc8ef into develop Jul 3, 2019

2 checks passed

ci/circleci: build Your tests passed on CircleCI!
Details
license/cla Contributor License Agreement is signed.
Details

@chriselion chriselion deleted the develop-rewardsignalsrefactor branch Jul 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.