Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GAIL and Pretraining #2118

Merged
merged 146 commits into from
Jul 16, 2019
Merged

GAIL and Pretraining #2118

merged 146 commits into from
Jul 16, 2019

Conversation

ervteng
Copy link
Contributor

@ervteng ervteng commented Jun 10, 2019

Based on the new reward signals architecture, add BC pretrainer and GAIL for PPO. Main changes:

  • A new GAILRewardSignal and GAILModel for GAIL
  • A BCModule component (not a reward signal) to do pretraining during RL
  • Documentation for both of these
  • Change to Demo Loader that lets you load multiple demo files in a folder
  • Example Demo files for all of our tested sample environments (for future regression testing)

docs/Training-PPO.md Outdated Show resolved Hide resolved

Typical Range: `3` - `10`

### (Optional) Max Batches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one might be a little confusing to people.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I also think it's necessary for people who have a huge demonstration dataset. We could do a couple of things:

  • Remove the option and just set to the buffer size given by PPO - perhaps allow overriding
  • Change the option to Samples Per Update or Demonstration Buffer Size
  • Leave as-is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is useful to have. I think it just needs a different name that is a little more descriptive. "Samples Per Update" could fit the bill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! (and yes any issues were related to the stochasticity)

):
"""
The initializer for the GAIL reward generator.
https://arxiv.org/abs/1606.03476
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to reference VAIL as well? Since we are using it, it would probably help people reading out code to know why we are doing something so different from GAIL. Also, have we shown that VAIL does indeed outperform GAIL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made VAIL an option. For simple stuff like Pyramids and Crawler, VAIL actually learns slower. I believe it makes learning more stable, but just from the learning speed on Pyramids and Crawler, I've decided to default to use_vail = False

hidden_1 = tf.layers.dense(
concat_input,
self.h_size,
activation=tf.nn.elu,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use swish for these activations function as well (for consistency) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Related note: should we call it LearningModel.activation rather than LearningModel.swish so that if we change it in the future we can just change one spot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave that for a future PR, for now changed to swish


# for reporting
kl_loss = []
pos = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we want to give these variables more semantically interpretable names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated these. TBH, we might be able to remove this since they're only there for printing during debug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I think it makes sense to remove them.


def reward_signal_update(env, policy, reward_signal_name):
brain_info_list = []
for i in range(20):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few numerical values hard-coded here. Can we replace them with vars/consts?

Copy link
Contributor

@awjuliani awjuliani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 🇮🇹

policy: TFPolicy,
strength: float,
gamma: float,
demo_path: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(discussed offline) I think it we could make this take a Buffer instance instead of the path, it would be a little cleaner. But might not be easy to pass it down from the TrainerController.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way we could do this is to add a demo_path config to the highest level (the PPO trainer), and if it exists, the TrainerController will load the demo and produce a demo Buffer, and pass it to the Trainer. Else it sets to None. Then the PreTraining and GAIL modules will attempt to grab the buffer from the Trainer and throw an exception if it is None.

But this removes the ability to set a different demo file for each module separately and makes it a bit less compartmentalized from before - I'm not sure if at the end of the day it's overall cleaner

@ervteng ervteng merged commit 34e852c into develop Jul 16, 2019
mantasp added a commit that referenced this pull request Jul 22, 2019
* develop: (69 commits)
  Add different types of visual encoder (nature cnn/resnet)
  Make SubprocessEnvManager take asynchronous steps (#2265)
  update mypy version
  one more unused
  remove unused variables
  Fix respawn part of BananaLogic (#2277)
  fix whitespace and line breaks
  remove codacy (#2287)
  Ported documentation from other branch
  tennis reset parameter implementation ported over
  Fixed the default value to match the value in the docs
  two soccer reset parameter implementation ported over
  3D ball reset parameter implementation ported over
  3D ball reset parameter implementation ported over
  Relax the cloudpickle version restriction (#2279)
  Fix get_value_estimate and buffer append (#2276)
  fix lint checks
  Add Unity command line arguments
  Swap 0 set and reward buffer append (#2273)
  GAIL and Pretraining (#2118)
  ...
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants