Unity-Technologies · ervteng · Jul 16, 2019 · Oct 9, 2018 · Oct 12, 2018 · Oct 12, 2018
diff --git a/demos/Expert3DBall.demo b/demos/Expert3DBall.demo
diff --git a/demos/Expert3DBallHard.demo b/demos/Expert3DBallHard.demo
diff --git a/demos/ExpertBanana.demo b/demos/ExpertBanana.demo
diff --git a/demos/ExpertBasic.demo b/demos/ExpertBasic.demo
diff --git a/demos/ExpertBouncer.demo b/demos/ExpertBouncer.demo
diff --git a/demos/ExpertCrawlerDyn.demo b/demos/ExpertCrawlerDyn.demo
diff --git a/demos/ExpertCrawlerSta.demo b/demos/ExpertCrawlerSta.demo
diff --git a/demos/ExpertGrid.demo b/demos/ExpertGrid.demo
diff --git a/demos/ExpertHallway.demo b/demos/ExpertHallway.demo
diff --git a/demos/ExpertPush.demo b/demos/ExpertPush.demo
diff --git a/demos/ExpertPyramid.demo b/demos/ExpertPyramid.demo
diff --git a/demos/ExpertReacher.demo b/demos/ExpertReacher.demo
diff --git a/demos/ExpertSoccerGoal.demo b/demos/ExpertSoccerGoal.demo
diff --git a/demos/ExpertSoccerStri.demo b/demos/ExpertSoccerStri.demo
diff --git a/demos/ExpertTennis.demo b/demos/ExpertTennis.demo
diff --git a/demos/ExpertWalker.demo b/demos/ExpertWalker.demo
diff --git a/docs/Training-BehavioralCloning.md b/docs/Training-BehavioralCloning.md
@@ -0,0 +1,92 @@
+# Training with Behavioral Cloning
+
+There are a variety of possible imitation learning algorithms which can 
+be used, the simplest one of them is Behavioral Cloning. It works by collecting 
+demonstrations from a teacher, and then simply uses them to directly learn a 
+policy, in the same way the supervised learning for image classification 
+or other traditional Machine Learning tasks work.
+
+## Offline Training
+
+With offline behavioral cloning, we can use demonstrations (`.demo` files) 
+generated using the `Demonstration Recorder` as the dataset used to train a behavior.
+
+1. Choose an agent you would like to learn to imitate some set of demonstrations. 
+2. Record a set of demonstration using the `Demonstration Recorder` (see [here](Training-Imitation-Learning.md)). 
+   For illustrative purposes we will refer to this file as `AgentRecording.demo`. 
+3. Build the scene, assigning the agent a Learning Brain, and set the Brain to 
+   Control in the Broadcast Hub. For more information on Brains, see 
+   [here](Learning-Environment-Design-Brains.md).
+4. Open the `config/offline_bc_config.yaml` file. 
+5. Modify the `demo_path` parameter in the file to reference the path to the 
+   demonstration file recorded in step 2. In our case this is: 
+   `./UnitySDK/Assets/Demonstrations/AgentRecording.demo`
+6. Launch `mlagent-learn`, providing `./config/offline_bc_config.yaml` 
+   as the config parameter, and include the `--run-id` and `--train` as usual. 
+   Provide your environment as the `--env` parameter if it has been compiled 
+   as standalone, or omit to train in the editor.
+7. (Optional) Observe training performance using TensorBoard.
+
+This will use the demonstration file to train a neural network driven agent 
+to directly imitate the actions provided in the demonstration. The environment 
+will launch and be used for evaluating the agent's performance during training.
+
+## Online Training
+
+It is also possible to provide demonstrations in realtime during training, 
+without pre-recording a demonstration file. The steps to do this are as follows:
+
+1. First create two Brains, one which will be the "Teacher," and the other which
+   will be the "Student." We will assume that the names of the Brain
+   Assets are "Teacher" and "Student" respectively.
+2. The "Teacher" Brain must be a **Player Brain**. You must properly 
+   configure the inputs to map to the corresponding actions.
+3. The "Student" Brain must be a **Learning Brain**.
+4. The Brain Parameters of both the "Teacher" and "Student" Brains must be 
+   compatible with the agent.
+5. Drag both the "Teacher" and "Student" Brain into the Academy's `Broadcast Hub` 
+   and check the `Control` checkbox on the "Student" Brain. 
+6. Link the Brains to the desired Agents (one Agent as the teacher and at least
+   one Agent as a student).
+7. In `config/online_bc_config.yaml`, add an entry for the "Student" Brain. Set
+   the `trainer` parameter of this entry to `online_bc`, and the
+   `brain_to_imitate` parameter to the name of the teacher Brain: "Teacher".
+   Additionally, set `batches_per_epoch`, which controls how much training to do
+   each moment. Increase the `max_steps` option if you'd like to keep training
+   the Agents for a longer period of time.
+8. Launch the training process with `mlagents-learn config/online_bc_config.yaml
+   --train --slow`, and press the :arrow_forward: button in Unity when the
+   message _"Start training by pressing the Play button in the Unity Editor"_ is
+   displayed on the screen
+9. From the Unity window, control the Agent with the Teacher Brain by providing
+   "teacher demonstrations" of the behavior you would like to see.
+10. Watch as the Agent(s) with the student Brain attached begin to behave
+   similarly to the demonstrations.
+11. Once the Student Agents are exhibiting the desired behavior, end the training
+   process with `CTL+C` from the command line.
+12. Move the resulting `*.nn` file into the `TFModels` subdirectory of the
+    Assets folder (or a subdirectory within Assets of your choosing) , and use
+    with `Learning` Brain.
+
+**BC Teacher Helper**
+
+We provide a convenience utility, `BC Teacher Helper` component that you can add
+to the Teacher Agent.
+
+<p align="center">
+  <img src="images/bc_teacher_helper.png"
+       alt="BC Teacher Helper"
+       width="375" border="10" />
+</p>
+
+This utility enables you to use keyboard shortcuts to do the following:
+
+1. To start and stop recording experiences. This is useful in case you'd like to
+   interact with the game _but not have the agents learn from these
+   interactions_. The default command to toggle this is to press `R` on the
+   keyboard.
+
+2. Reset the training buffer. This enables you to instruct the agents to forget
+   their buffer of recent experiences. This is useful if you'd like to get them
+   to quickly learn a new behavior. The default command to reset the buffer is
+   to press `C` on the keyboard.
diff --git a/docs/Training-Imitation-Learning.md b/docs/Training-Imitation-Learning.md
@@ -10,6 +10,35 @@ from the game and actions from a game controller to guide the medic's behavior.
 Imitation Learning uses pairs of observations and actions from
 from a demonstration to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFYs).
 
+Imitation learning can also be used to help reinforcement learning. Especially in 
+environments with sparse (i.e., infrequent or rare) rewards, the agent may never see
+the reward and thus not learn from it. Curiosity helps the agent explore, but in some cases
+it is easier to just show the agent how to achieve the reward. In these cases, 
+imitation learning can dramatically reduce the time it takes to solve the environment.
+For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids), 
+just 6 episodes of demonstrations can reduce training steps by more than 4 times.
+
+<p align="center">
+  <img src="images/mlagents-ImitationAndRL.png"
+       alt="Using Demonstrations with Reinforcement Learning"
+       width="350" border="0" />
+</p>
+
+ML-Agents provides several ways to interact with demonstrations. For most situations,
+[GAIL](Training-RewardSignals.md#the-gail-reward-signal) is the preferred approach.
+
+* To train using GAIL (Generative Adversarial Imitaiton Learning) you can add the
+  [GAIL reward signal](Training-RewardSignals.md#the-gail-reward-signal). GAIL can be
+  used with or without environment rewards, and works well when there are a limited
+  number of demonstrations. 
+* To help bootstrap reinforcement learning, you can enable 
+  [pretraining](Training-PPO.md#optional-pretraining-using-demonstrations) 
+  on the PPO trainer, in addition to using a small GAIL reward signal. 
+* To train an agent to exactly mimic demonstrations, you can use the 
+  [Behavioral Cloning](Training-BehavioralCloning.md) trainer. Behavioral Cloning can be
+  used offline and online (in-editor), and learns very quickly. However, it usually is ineffective
+  on more complex environments without a large number of demonstrations.
+
 ## Recording Demonstrations
 
 It is possible to record demonstrations of agent behavior from the Unity Editor, 
@@ -43,98 +72,4 @@ inspector.
        alt="BC Teacher Helper"
        width="375" border="10" />
 </p>
-
-
-## Training with Behavioral Cloning
-
-There are a variety of possible imitation learning algorithms which can 
-be used, the simplest one of them is Behavioral Cloning. It works by collecting 
-demonstrations from a teacher, and then simply uses them to directly learn a 
-policy, in the same way the supervised learning for image classification 
-or other traditional Machine Learning tasks work.
-
-
-### Offline Training
-
-With offline behavioral cloning, we can use demonstrations (`.demo` files) 
-generated using the `Demonstration Recorder` as the dataset used to train a behavior.
-
-1. Choose an agent you would like to learn to imitate some set of demonstrations. 
-2. Record a set of demonstration using the `Demonstration Recorder` (see above). 
-   For illustrative purposes we will refer to this file as `AgentRecording.demo`. 
-3. Build the scene, assigning the agent a Learning Brain, and set the Brain to 
-   Control in the Broadcast Hub. For more information on Brains, see 
-   [here](Learning-Environment-Design-Brains.md).
-4. Open the `config/offline_bc_config.yaml` file. 
-5. Modify the `demo_path` parameter in the file to reference the path to the 
-   demonstration file recorded in step 2. In our case this is: 
-   `./UnitySDK/Assets/Demonstrations/AgentRecording.demo`
-6. Launch `mlagent-learn`, providing `./config/offline_bc_config.yaml` 
-   as the config parameter, and include the `--run-id` and `--train` as usual. 
-   Provide your environment as the `--env` parameter if it has been compiled 
-   as standalone, or omit to train in the editor.
-7. (Optional) Observe training performance using TensorBoard.
-
-This will use the demonstration file to train a neural network driven agent 
-to directly imitate the actions provided in the demonstration. The environment 
-will launch and be used for evaluating the agent's performance during training.
-
-### Online Training
-
-It is also possible to provide demonstrations in realtime during training, 
-without pre-recording a demonstration file. The steps to do this are as follows:
-
-1. First create two Brains, one which will be the "Teacher," and the other which
-   will be the "Student." We will assume that the names of the Brain
-   Assets are "Teacher" and "Student" respectively.
-2. The "Teacher" Brain must be a **Player Brain**. You must properly 
-   configure the inputs to map to the corresponding actions.
-3. The "Student" Brain must be a **Learning Brain**.
-4. The Brain Parameters of both the "Teacher" and "Student" Brains must be 
-   compatible with the agent.
-5. Drag both the "Teacher" and "Student" Brain into the Academy's `Broadcast Hub` 
-   and check the `Control` checkbox on the "Student" Brain. 
-6. Link the Brains to the desired Agents (one Agent as the teacher and at least
-   one Agent as a student).
-7. In `config/online_bc_config.yaml`, add an entry for the "Student" Brain. Set
-   the `trainer` parameter of this entry to `online_bc`, and the
-   `brain_to_imitate` parameter to the name of the teacher Brain: "Teacher".
-   Additionally, set `batches_per_epoch`, which controls how much training to do
-   each moment. Increase the `max_steps` option if you'd like to keep training
-   the Agents for a longer period of time.
-8. Launch the training process with `mlagents-learn config/online_bc_config.yaml
-   --train --slow`, and press the :arrow_forward: button in Unity when the
-   message _"Start training by pressing the Play button in the Unity Editor"_ is
-   displayed on the screen
-9. From the Unity window, control the Agent with the Teacher Brain by providing
-   "teacher demonstrations" of the behavior you would like to see.
-10. Watch as the Agent(s) with the student Brain attached begin to behave
-   similarly to the demonstrations.
-11. Once the Student Agents are exhibiting the desired behavior, end the training
-   process with `CTL+C` from the command line.
-12. Move the resulting `*.nn` file into the `TFModels` subdirectory of the
-    Assets folder (or a subdirectory within Assets of your choosing) , and use
-    with `Learning` Brain.
-
-**BC Teacher Helper**
-
-We provide a convenience utility, `BC Teacher Helper` component that you can add
-to the Teacher Agent.
-
-<p align="center">
-  <img src="images/bc_teacher_helper.png"
-       alt="BC Teacher Helper"
-       width="375" border="10" />
-</p>
-
-This utility enables you to use keyboard shortcuts to do the following:
-
-1. To start and stop recording experiences. This is useful in case you'd like to
-   interact with the game _but not have the agents learn from these
-   interactions_. The default command to toggle this is to press `R` on the
-   keyboard.
-
-2. Reset the training buffer. This enables you to instruct the agents to forget
-   their buffer of recent experiences. This is useful if you'd like to get them
-   to quickly learn a new behavior. The default command to reset the buffer is
-   to press `C` on the keyboard.
+
diff --git a/docs/Training-PPO.md b/docs/Training-PPO.md
@@ -22,8 +22,7 @@ If you are using curriculum training to pace the difficulty of the learning task
 presented to an agent, see [Training with Curriculum
 Learning](Training-Curriculum-Learning.md).
 
-For information about imitation learning, which uses a different training
-algorithm, see
+For information about imitation learning from demonstrations, see
 [Training with Imitation Learning](Training-Imitation-Learning.md).
 
 ## Best Practices when training with PPO
@@ -191,6 +190,73 @@ the agent will need to remember in order to successfully complete the task.
 
 Typical Range: `64` - `512`
 
+## (Optional) Pretraining Using Demonstrations
+
+In some cases, we would want to bootstrap the agent's policy using behavior recorded
+from a player. This can help guide the agent towards the reward. Pretraining adds 
+training operations that mimic a demonstration rather than attempting to maximize reward. 
+It is essentially equivalent to running [behavioral cloning](./Training-BehavioralCloning.md)
+in-line with PPO.
+
+To use pretraining, add a `pretraining` section to the trainer_config. For instance:
+
+```
+    pretraining:
+        demo_path: ./demos/ExpertPyramid.demo
+        strength: 0.5
+        steps: 10000
+```
+
+Below are the avaliable hyperparameters for pretraining.
+
+### Strength
+
+`strength` corresponds to the learning rate of the imitation relative to the learning
+rate of PPO, and roughly corresponds to how strongly we allow the behavioral cloning
+to influence the policy. 
+
+Typical Range: `0.1` - `0.5`
+
+### Demo Path
+
+`demo_path` is the path to your `.demo` file or directory of `.demo` files. 
+See the [imitation learning guide](Training-ImitationLearning.md) for more on `.demo` files.
+
+### Steps
+
+During pretraining, it is often desirable to stop using demonstrations after the agent has 
+"seen" rewards, and allow it to optimize past the available demonstrations and/or generalize
+outside of the provided demonstrations. `steps` corresponds to the training steps over which
+pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set 
+the steps to 0 for constant imitation over the entire training run. 
+
+### (Optional) Batch Size
+
+`batch_size` is the number of demonstration experiences used for one iteration of a gradient
+descent update. If not specified, it will default to the `batch_size` defined for PPO.
+
+Typical Range (Continuous): `512` - `5120`
+
+Typical Range (Discrete): `32` - `512`
+
+### (Optional) Number of Epochs
+
+`num_epoch` is the number of passes through the experience buffer during
+gradient descent. If not specified, it will default to the number of epochs set for PPO.
+
+Typical Range: `3` - `10`
+
+### (Optional) Max Batches
+
+`max_batches` are the maximum number of batches of `batch_size`
+to use during each imitation update. You may want to lower this if your demonstration
+dataset is very large to avoid overfitting the policy on demonstrations. Set to 0 
+to train over all of the demonstrations at each update step.
+
+Default Value: `0` (all)
+
+Typical Range: `10`-`20`
+
 ## Training Statistics
 
 To view training statistics, use TensorBoard. For information on launching and