-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GAIL and Pretraining #2118
GAIL and Pretraining #2118
Conversation
# Conflicts: # ml-agents/mlagents/trainers/models.py
ml-agents/mlagents/trainers/components/reward_signals/gail/model.py
Outdated
Show resolved
Hide resolved
ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py
Outdated
Show resolved
Hide resolved
docs/Training-PPO.md
Outdated
|
||
Typical Range: `3` - `10` | ||
|
||
### (Optional) Max Batches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one might be a little confusing to people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I also think it's necessary for people who have a huge demonstration dataset. We could do a couple of things:
- Remove the option and just set to the buffer size given by PPO - perhaps allow overriding
- Change the option to Samples Per Update or Demonstration Buffer Size
- Leave as-is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is useful to have. I think it just needs a different name that is a little more descriptive. "Samples Per Update" could fit the bill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! (and yes any issues were related to the stochasticity)
): | ||
""" | ||
The initializer for the GAIL reward generator. | ||
https://arxiv.org/abs/1606.03476 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to reference VAIL as well? Since we are using it, it would probably help people reading out code to know why we are doing something so different from GAIL. Also, have we shown that VAIL does indeed outperform GAIL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made VAIL an option. For simple stuff like Pyramids and Crawler, VAIL actually learns slower. I believe it makes learning more stable, but just from the learning speed on Pyramids and Crawler, I've decided to default to use_vail = False
hidden_1 = tf.layers.dense( | ||
concat_input, | ||
self.h_size, | ||
activation=tf.nn.elu, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use swish for these activations function as well (for consistency) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Related note: should we call it LearningModel.activation
rather than LearningModel.swish
so that if we change it in the future we can just change one spot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will leave that for a future PR, for now changed to swish
|
||
# for reporting | ||
kl_loss = [] | ||
pos = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we want to give these variables more semantically interpretable names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated these. TBH, we might be able to remove this since they're only there for printing during debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I think it makes sense to remove them.
|
||
def reward_signal_update(env, policy, reward_signal_name): | ||
brain_info_list = [] | ||
for i in range(20): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few numerical values hard-coded here. Can we replace them with vars/consts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 🇮🇹
policy: TFPolicy, | ||
strength: float, | ||
gamma: float, | ||
demo_path: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(discussed offline) I think it we could make this take a Buffer instance instead of the path, it would be a little cleaner. But might not be easy to pass it down from the TrainerController.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way we could do this is to add a demo_path
config to the highest level (the PPO trainer), and if it exists, the TrainerController will load the demo and produce a demo Buffer, and pass it to the Trainer. Else it sets to None. Then the PreTraining and GAIL modules will attempt to grab the buffer from the Trainer and throw an exception if it is None.
But this removes the ability to set a different demo file for each module separately and makes it a bit less compartmentalized from before - I'm not sure if at the end of the day it's overall cleaner
ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py
Outdated
Show resolved
Hide resolved
* develop: (69 commits) Add different types of visual encoder (nature cnn/resnet) Make SubprocessEnvManager take asynchronous steps (#2265) update mypy version one more unused remove unused variables Fix respawn part of BananaLogic (#2277) fix whitespace and line breaks remove codacy (#2287) Ported documentation from other branch tennis reset parameter implementation ported over Fixed the default value to match the value in the docs two soccer reset parameter implementation ported over 3D ball reset parameter implementation ported over 3D ball reset parameter implementation ported over Relax the cloudpickle version restriction (#2279) Fix get_value_estimate and buffer append (#2276) fix lint checks Add Unity command line arguments Swap 0 set and reward buffer append (#2273) GAIL and Pretraining (#2118) ...
Based on the new reward signals architecture, add BC pretrainer and GAIL for PPO. Main changes: