GAIL and Pretraining #2118

ervteng · 2019-06-10T23:20:43Z

Based on the new reward signals architecture, add BC pretrainer and GAIL for PPO. Main changes:

A new GAILRewardSignal and GAILModel for GAIL
A BCModule component (not a reward signal) to do pretraining during RL
Documentation for both of these
Change to Demo Loader that lets you load multiple demo files in a folder
Example Demo files for all of our tested sample environments (for future regression testing)

# Conflicts: # ml-agents/mlagents/trainers/models.py

ml-agents/mlagents/trainers/components/reward_signals/gail/model.py

ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py

docs/Training-Imitation-Learning.md

docs/Training-PPO.md

awjuliani · 2019-07-15T18:39:34Z

docs/Training-PPO.md

+
+Typical Range: `3` - `10`
+
+### (Optional) Max Batches


This one might be a little confusing to people.

I agree, but I also think it's necessary for people who have a huge demonstration dataset. We could do a couple of things:

Remove the option and just set to the buffer size given by PPO - perhaps allow overriding

Change the option to Samples Per Update or Demonstration Buffer Size

Leave as-is

I agree that it is useful to have. I think it just needs a different name that is a little more descriptive. "Samples Per Update" could fit the bill.

Done! (and yes any issues were related to the stochasticity)

ml-agents/mlagents/trainers/components/bc/model.py

awjuliani · 2019-07-15T18:47:25Z

ml-agents/mlagents/trainers/components/reward_signals/gail/model.py

+    ):
+        """
+        The initializer for the GAIL reward generator.
+        https://arxiv.org/abs/1606.03476


Do we want to reference VAIL as well? Since we are using it, it would probably help people reading out code to know why we are doing something so different from GAIL. Also, have we shown that VAIL does indeed outperform GAIL?

Made VAIL an option. For simple stuff like Pyramids and Crawler, VAIL actually learns slower. I believe it makes learning more stable, but just from the learning speed on Pyramids and Crawler, I've decided to default to use_vail = False

awjuliani · 2019-07-15T18:48:09Z

ml-agents/mlagents/trainers/components/reward_signals/gail/model.py

+            hidden_1 = tf.layers.dense(
+                concat_input,
+                self.h_size,
+                activation=tf.nn.elu,


Do we want to use swish for these activations function as well (for consistency) ?

Yes. Related note: should we call it LearningModel.activation rather than LearningModel.swish so that if we change it in the future we can just change one spot?

I think that makes sense.

Will leave that for a future PR, for now changed to swish

awjuliani · 2019-07-15T18:49:53Z

ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py

+
+        # for reporting
+        kl_loss = []
+        pos = []


Maybe we want to give these variables more semantically interpretable names?

I updated these. TBH, we might be able to remove this since they're only there for printing during debug.

In that case I think it makes sense to remove them.

awjuliani · 2019-07-15T18:53:49Z

ml-agents/mlagents/trainers/tests/test_reward_signals.py

+
+def reward_signal_update(env, policy, reward_signal_name):
+    brain_info_list = []
+    for i in range(20):


There are a few numerical values hard-coded here. Can we replace them with vars/consts?

awjuliani

🚢 🇮🇹

chriselion · 2019-07-15T20:09:21Z

ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py

+        policy: TFPolicy,
+        strength: float,
+        gamma: float,
+        demo_path: str,


(discussed offline) I think it we could make this take a Buffer instance instead of the path, it would be a little cleaner. But might not be easy to pass it down from the TrainerController.

One way we could do this is to add a demo_path config to the highest level (the PPO trainer), and if it exists, the TrainerController will load the demo and produce a demo Buffer, and pass it to the Trainer. Else it sets to None. Then the PreTraining and GAIL modules will attempt to grab the buffer from the Trainer and throw an exception if it is None.

But this removes the ability to set a different demo file for each module separately and makes it a bit less compartmentalized from before - I'm not sure if at the end of the day it's overall cleaner

ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py

* develop: (69 commits) Add different types of visual encoder (nature cnn/resnet) Make SubprocessEnvManager take asynchronous steps (#2265) update mypy version one more unused remove unused variables Fix respawn part of BananaLogic (#2277) fix whitespace and line breaks remove codacy (#2287) Ported documentation from other branch tennis reset parameter implementation ported over Fixed the default value to match the value in the docs two soccer reset parameter implementation ported over 3D ball reset parameter implementation ported over 3D ball reset parameter implementation ported over Relax the cloudpickle version restriction (#2279) Fix get_value_estimate and buffer append (#2276) fix lint checks Add Unity command line arguments Swap 0 set and reward buffer append (#2273) GAIL and Pretraining (#2118) ...

awjuliani and others added 30 commits April 24, 2019 13:49

New version of GAIL

eb4abf2

Move Curiosity to separate class

d0852ac

Curiosity fully working under new system

4b15b80

Begin implementing GAIL

ad9381b

fix discrete curiosity

8bf8302

# Conflicts: # ml-agents/mlagents/trainers/models.py

Add expert demonstration

d3e244e

Remove notebook

a5b95f7

Record intrinsic rewards properly

dc2fcaa

Add gail model updating

49cff40

Code cleanup

48d3769

Nested structure for intrinsic rewards

6eeb565

Rename files

8ca7728

Update models so files

226b5c7

fix typo

3386aa7

Add reward strength parameter

6799756

Use dictionary of reward signals

468c407

Remove reward manager

519e2d3

Extrinsic reward just another type

7df1a69

Clean up imports

99237cd

All reward signals use strength to scale output

9fa51c1

produce scaled and unscaled reward

7f24677

Remove unused dictionary

4a714d0

Current trainer config

3e2671d

Add discrete control and pyramid experimentation

77211d8

Minor changes to GAIL

2334de8

Add relevant strength parameters

439387e

Replace string

ba793a3

Add support for visual observations w/ GAIL

a52ba0b

Finish implementing visual obs for GAIL

5b2ef22

Include demo files

13542b4

chriselion reviewed Jul 12, 2019

View reviewed changes

ml-agents/mlagents/trainers/components/reward_signals/gail/model.py Outdated Show resolved Hide resolved

chriselion reviewed Jul 12, 2019

View reviewed changes

ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py Outdated Show resolved Hide resolved

Ervin Teng added 2 commits July 12, 2019 18:37

More sophisticated tests for reward signals

736c807

Fix bug in GAIL when num_sequences is 1

04e22fd