From e8b91566dcf206480de3dbd808515ea45ce0b015 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 20:56:31 -0700
Subject: [PATCH 01/49] Included explicit version # for ZN

---
 docs/localized/zh-CN/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/localized/zh-CN/README.md b/docs/localized/zh-CN/README.md
index 0d38124091..20ef6a4cdd 100755
--- a/docs/localized/zh-CN/README.md
+++ b/docs/localized/zh-CN/README.md
@@ -1,6 +1,6 @@
 <img src="docs/images/unity-wide.png" align="middle" width="3000"/>
 
-# Unity ML-Agents 工具包(Beta)
+# Unity ML-Agents 工具包(Beta) v0.3.1
 
 **注意:** 本文档为v0.3版本文档的部分翻译版，目前并不会随着英文版文档更新而更新。若要查看更新更全的英文版文档，请查看[这里](https://github.com/Unity-Technologies/ml-agents)。
 

From 0602ba4957cf81c95196b9ac935dc470bd454879 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 20:57:09 -0700
Subject: [PATCH 02/49] added explicit version for KR docs

---
 docs/localized/KR/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/localized/KR/README.md b/docs/localized/KR/README.md
index 03d6b1a92e..ac49277f6e 100644
--- a/docs/localized/KR/README.md
+++ b/docs/localized/KR/README.md
@@ -1,6 +1,6 @@
 ﻿<img src="docs/images/unity-wide.png" align="middle" width="3000"/>
 
-# Unity ML-Agents Toolkit (Beta)
+# Unity ML-Agents Toolkit (Beta) v0.9
 [![docs badge](https://img.shields.io/badge/docs-reference-blue.svg)](docs/Readme.md)
 [![license badge](https://img.shields.io/badge/license-Apache--2.0-green.svg)](LICENSE)
 

From 8a8a90862d90630e3aa677b95b1d552409d2eef0 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:00:22 -0700
Subject: [PATCH 03/49] minor fix in installation doc

---
 docs/Installation.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Installation.md b/docs/Installation.md
index 198d201748..bb87d88b17 100644
--- a/docs/Installation.md
+++ b/docs/Installation.md
@@ -63,7 +63,7 @@ If you installed this correctly, you should be able to run
 `mlagents-learn --help`, after which you will see the Unity logo and the command line
 parameters you can use with `mlagents-learn`. 
 
-By installing the `mlagents` package, its dependencies listed in the [setup.py file](../ml-agents/setup.py) are also installed.
+By installing the `mlagents` package, the dependencies listed in the [setup.py file](../ml-agents/setup.py) are also installed.
 Some of the primary dependencies include:
 
 - [TensorFlow](Background-TensorFlow.md) (Requires a CPU w/ AVX support)

From 849cf8dd04efc738e79e8c38f4badb91b6acde1f Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:02:46 -0700
Subject: [PATCH 04/49] Consistency with numbers for reset parameters

---
 docs/Learning-Environment-Examples.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Learning-Environment-Examples.md b/docs/Learning-Environment-Examples.md
index dcdcbbf947..81590eb0eb 100644
--- a/docs/Learning-Environment-Examples.md
+++ b/docs/Learning-Environment-Examples.md
@@ -195,7 +195,7 @@ If you would like to contribute environments, please see our
     * Side Motion (3 possible actions: Left, Right, No Action)
     * Jump (2 possible actions: Jump, No Action)
   * Visual Observations: None.
-* Reset Parameters: 4, corresponding to the height of the possible walls.
+* Reset Parameters: Four, corresponding to the height of the possible walls.
 * Benchmark Mean Reward (Big & Small Wall Brain): 0.8
 
 ## [Reacher](https://youtu.be/2N9EoF6pQyE)

From e344aa090c0d6733c2f9bd22e945c11ee124df6f Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:08:48 -0700
Subject: [PATCH 05/49] Removed extra verbiage. minor consistency

---
 docs/Learning-Environment-Examples.md | 32 +++++++++++++--------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/docs/Learning-Environment-Examples.md b/docs/Learning-Environment-Examples.md
index 81590eb0eb..3d695063b7 100644
--- a/docs/Learning-Environment-Examples.md
+++ b/docs/Learning-Environment-Examples.md
@@ -32,7 +32,7 @@ If you would like to contribute environments, please see our
   * Vector Observation space: One variable corresponding to current state.
   * Vector Action space: (Discrete) Two possible actions (Move left, move
     right).
-  * Visual Observations: None.
+  * Visual Observations: None
 * Reset Parameters: None
 * Benchmark Mean Reward: 0.94
 
@@ -56,7 +56,7 @@ If you would like to contribute environments, please see our
   * Vector Action space: (Continuous) Size of 2, with one value corresponding to
     X-rotation, and the other to Z-rotation.
   * Visual Observations: None.
-* Reset Parameters: Three, corresponding to the following:
+* Reset Parameters: Three
     * scale: Specifies the scale of the ball in the 3 dimensions (equal across the three dimensions)
       * Default: 1
       * Recommended Minimum: 0.2
@@ -117,7 +117,7 @@ If you would like to contribute environments, please see our
   * Vector Action space: (Continuous) Size of 2, corresponding to movement
     toward net or away from net, and jumping.
   * Visual Observations: None.
-* Reset Parameters: Three, corresponding to the following:
+* Reset Parameters: Three
     * angle: Angle of the racket from the vertical (Y) axis.
       * Default: 55
       * Recommended Minimum: 35 
@@ -153,7 +153,7 @@ If you would like to contribute environments, please see our
     `VisualPushBlock` scene. __The visual observation version of
      this environment does not train with the provided default
      training parameters.__
-* Reset Parameters: Four, corresponding to the following:
+* Reset Parameters: Four
     * block_scale: Scale of the block along the x and z dimensions
         * Default: 2
         * Recommended Minimum: 0.5
@@ -195,7 +195,7 @@ If you would like to contribute environments, please see our
     * Side Motion (3 possible actions: Left, Right, No Action)
     * Jump (2 possible actions: Jump, No Action)
   * Visual Observations: None.
-* Reset Parameters: Four, corresponding to the height of the possible walls.
+* Reset Parameters: Four.
 * Benchmark Mean Reward (Big & Small Wall Brain): 0.8
 
 ## [Reacher](https://youtu.be/2N9EoF6pQyE)
@@ -213,7 +213,7 @@ If you would like to contribute environments, please see our
   * Vector Action space: (Continuous) Size of 4, corresponding to torque
     applicable to two joints.
   * Visual Observations: None.
-* Reset Parameters: Five, corresponding to the following
+* Reset Parameters: Five
   * goal_size: radius of the goal zone
     * Default: 5
     * Recommended Minimum: 1
@@ -254,7 +254,7 @@ If you would like to contribute environments, please see our
     angular acceleration of the body.
   * Vector Action space: (Continuous) Size of 20, corresponding to target
     rotations for joints.
-  * Visual Observations: None.
+  * Visual Observations: None
 * Reset Parameters: None
 * Benchmark Mean Reward for `CrawlerStaticTarget`: 2000
 * Benchmark Mean Reward for `CrawlerDynamicTarget`: 400
@@ -284,7 +284,7 @@ If you would like to contribute environments, please see our
     `VisualBanana` scene. __The visual observation version of
      this environment does not train with the provided default
      training parameters.__
-* Reset Parameters: Two, corresponding to the following
+* Reset Parameters: Two
   * laser_length: Length of the laser used by the agent
     * Default: 1
     * Recommended Minimum: 0.2
@@ -318,7 +318,7 @@ If you would like to contribute environments, please see our
     `VisualHallway` scene. __The visual observation version of
      this environment does not train with the provided default
      training parameters.__
-* Reset Parameters: None.
+* Reset Parameters: None
 * Benchmark Mean Reward: 0.7
   * To speed up training, you can enable curiosity by adding `use_curiosity: true` in `config/trainer_config.yaml`
 * Optional Imitation Learning scene: `HallwayIL`.
@@ -340,8 +340,8 @@ If you would like to contribute environments, please see our
     banana.
   * Vector Action space: (Continuous) 3 corresponding to agent force applied for
     the jump.
-  * Visual Observations: None.
-* Reset Parameters: Two, corresponding to the following
+  * Visual Observations: None
+* Reset Parameters: Two
     * banana_scale: The scale of the banana in the 3 dimensions
         * Default: 150
         * Recommended Minimum: 50
@@ -375,8 +375,8 @@ If you would like to contribute environments, please see our
     * Striker: 6 actions corresponding to forward, backward, sideways movement,
       as well as rotation.
     * Goalie: 4 actions corresponding to forward, backward, sideways movement.
-  * Visual Observations: None.
-* Reset Parameters: Two, corresponding to the following:
+  * Visual Observations: None
+* Reset Parameters: Two
   * ball_scale: Specifies the scale of the ball in the 3 dimensions (equal across the three dimensions)
     * Default: 7.5
     * Recommended minimum: 4
@@ -409,8 +409,8 @@ If you would like to contribute environments, please see our
     velocity, and angular velocities of each limb, along with goal direction.
   * Vector Action space: (Continuous) Size of 39, corresponding to target
     rotations applicable to the joints.
-  * Visual Observations: None.
-* Reset Parameters: Four, corresponding to the following
+  * Visual Observations: None
+* Reset Parameters: Four
     * gravity: Magnitude of gravity
         * Default: 9.81
         * Recommended Minimum:
@@ -450,6 +450,6 @@ If you would like to contribute environments, please see our
     `VisualPyramids` scene. __The visual observation version of
      this environment does not train with the provided default
      training parameters.__
-* Reset Parameters: None.
+* Reset Parameters: None
 * Optional Imitation Learning scene: `PyramidsIL`.
 * Benchmark Mean Reward: 1.75

From 5c48bd65dc35b178187be0171029fdd2f5aac6cf Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:09:15 -0700
Subject: [PATCH 06/49] minor consistency

---
 docs/Learning-Environment-Examples.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Learning-Environment-Examples.md b/docs/Learning-Environment-Examples.md
index 3d695063b7..f30f9d2aa8 100644
--- a/docs/Learning-Environment-Examples.md
+++ b/docs/Learning-Environment-Examples.md
@@ -116,7 +116,7 @@ If you would like to contribute environments, please see our
     of ball and racket.
   * Vector Action space: (Continuous) Size of 2, corresponding to movement
     toward net or away from net, and jumping.
-  * Visual Observations: None.
+  * Visual Observations: None
 * Reset Parameters: Three
     * angle: Angle of the racket from the vertical (Y) axis.
       * Default: 55

From 4925e08158696f0dc01d2d429a8e3a0bddd2dae1 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:16:31 -0700
Subject: [PATCH 07/49] Cleaned up IL language

---
 docs/ML-Agents-Overview.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/ML-Agents-Overview.md b/docs/ML-Agents-Overview.md
index daeb770745..3ca0af37b4 100644
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
@@ -319,11 +319,11 @@ imitation learning algorithm will then use these pairs of observations and
 actions from the human player to learn a policy. [Video
 Link](https://youtu.be/kpb8ZkMBFYs).
 
-ML-Agents provides ways to both learn directly from demonstrations as well as
-use demonstrations to help speed up reward-based training, and two algorithms to do
-so (Generative Adversarial Imitation Learning and Behavioral Cloning). The
-[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial
-covers these features in more depth.
+The toolkit provides a way for agents to learn directly from demonstrations using
+the Behavioral Cloning algorithm. The toolkit also enables use of demonstrations
+to help speed up reward-based (RL) training using the Generative Adversarial
+Imitation Learning (GAIL) algorithm. The [Training with Imitation Learning](Training-Imitation-Learning.md)
+tutorial covers these features in more depth.
 
 ## Flexible Training Scenarios
 

From b902ed81d4a0fda9e5784b995e839332552c43aa Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:20:16 -0700
Subject: [PATCH 08/49] moved parameter sampling above in list

---
 docs/ML-Agents-Overview.md | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/docs/ML-Agents-Overview.md b/docs/ML-Agents-Overview.md
index 3ca0af37b4..3355a7ece4 100644
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
@@ -408,6 +408,14 @@ training process.
   learn more about adding visual observations to an agent
   [here](Learning-Environment-Design-Agents.md#multiple-visual-observations).
 
+- **Training with Environment Parameter Sampling** - To train agents to be robust
+  to changes in its environment (i.e., generalization), the agent should be exposed
+  to a variety of environment variations. Similarly to Curriculum Learning, which
+  allows environments to get more difficult as the agent learns, we also provide
+  a way to randomly resample aspects of the environment during training. See
+  [Training with Environment Parameter Sampling](Training-Generalization-Learning.md)
+  to learn more about this feature.
+  
 - **Broadcasting** - As discussed earlier, a Learning Brain sends the
   observations for all its Agents to the Python API when dragged into the
   Academy's `Broadcast Hub` with the `Control` checkbox checked. This is helpful
@@ -422,14 +430,6 @@ training process.
   the broadcasting feature
   [here](Learning-Environment-Design-Brains.md#using-the-broadcast-feature).
 
-- **Training with Environment Parameter Sampling** - To train agents to be robust
-  to changes in its environment (i.e., generalization), the agent should be exposed
-  to a variety of environment variations. Similarly to Curriculum Learning, which
-  allows environments to get more difficult as the agent learns, we also provide
-  a way to randomly resample aspects of the environment during training. See
-  [Training with Environment Parameter Sampling](Training-Generalization-Learning.md)
-  to learn more about this feature.
-
 - **Docker Set-up (Experimental)** - To facilitate setting up ML-Agents without
   installing Python or TensorFlow directly, we provide a
   [guide](Using-Docker.md) on how to create and run a Docker container.

From 3e93bd8f4b1e46bee63bb42aed3e32ea269686aa Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:24:27 -0700
Subject: [PATCH 09/49] Cleaned up language in Env Parameter sampling

---
 docs/ML-Agents-Overview.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/ML-Agents-Overview.md b/docs/ML-Agents-Overview.md
index 3355a7ece4..b7b0d49aad 100644
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
@@ -408,11 +408,11 @@ training process.
   learn more about adding visual observations to an agent
   [here](Learning-Environment-Design-Agents.md#multiple-visual-observations).
 
-- **Training with Environment Parameter Sampling** - To train agents to be robust
+- **Training with Environment Parameter Sampling** - To train agents to be adapt
   to changes in its environment (i.e., generalization), the agent should be exposed
-  to a variety of environment variations. Similarly to Curriculum Learning, which
-  allows environments to get more difficult as the agent learns, we also provide
-  a way to randomly resample aspects of the environment during training. See
+  to a variations of the environment. Similar to Curriculum Learning,
+  where environments become more difficult as the agent learns, the toolkit provides
+  a way to randomly sample aspects of the environment during training. See
   [Training with Environment Parameter Sampling](Training-Generalization-Learning.md)
   to learn more about this feature.
   

From 5a8bc9b407678caab045cb20d4b514c4707e4da2 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:29:42 -0700
Subject: [PATCH 10/49] Cleaned up migrating content

---
 docs/Migrating.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/Migrating.md b/docs/Migrating.md
index 4b00fcc5e2..244e803afc 100644
--- a/docs/Migrating.md
+++ b/docs/Migrating.md
@@ -5,15 +5,15 @@
 ### Important Changes
 * We have changed the way reward signals (including Curiosity) are defined in the
 `trainer_config.yaml`.
-* When using multiple environments, every "step" as recorded in TensorBoard and
-printed in the command line now corresponds to a single step of a single environment.
+* When using multiple environments, every "step" is recorded in TensorBoard.
+* The steps in the command line console corresponds to a single step of a single environment.
 Previously, each step corresponded to one step for all environments (i.e., `num_envs` steps).
 
 #### Steps to Migrate
 * If you were overriding any of these following parameters in your config file, remove them
 from the top-level config and follow the steps below:
-  * `gamma` - Define a new `extrinsic` reward signal and set it's `gamma` to your new gamma.
-  * `use_curiosity`, `curiosity_strength`, `curiosity_enc_size` - Define a `curiosity` reward signal
+  * `gamma`: Define a new `extrinsic` reward signal and set it's `gamma` to your new gamma.
+  * `use_curiosity`, `curiosity_strength`, `curiosity_enc_size`: Define a `curiosity` reward signal
   and set its `strength` to `curiosity_strength`, and `encoding_size` to `curiosity_enc_size`. Give it
   the same `gamma` as your `extrinsic` signal to mimic previous behavior.
 See [Reward Signals](Training-RewardSignals.md) for more information on defining reward signals.

From 6451d44e94283dbe0d586d71b163413bdea47a87 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:31:17 -0700
Subject: [PATCH 11/49] updated consistency of Reset Parameter Sampling

---
 docs/ML-Agents-Overview.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/ML-Agents-Overview.md b/docs/ML-Agents-Overview.md
index b7b0d49aad..8f44928bba 100644
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
@@ -408,11 +408,11 @@ training process.
   learn more about adding visual observations to an agent
   [here](Learning-Environment-Design-Agents.md#multiple-visual-observations).
 
-- **Training with Environment Parameter Sampling** - To train agents to be adapt
+- **Training with Reset Parameter Sampling** - To train agents to be adapt
   to changes in its environment (i.e., generalization), the agent should be exposed
-  to a variations of the environment. Similar to Curriculum Learning,
+  to several variations of the environment. Similar to Curriculum Learning,
   where environments become more difficult as the agent learns, the toolkit provides
-  a way to randomly sample aspects of the environment during training. See
+  a way to randomly sample Reset Parameters of the environment during training. See
   [Training with Environment Parameter Sampling](Training-Generalization-Learning.md)
   to learn more about this feature.
   

From 6a8dc2388eb8aaf2b27271838a4e0d3d0c4829c0 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:34:51 -0700
Subject: [PATCH 12/49] Rename Training-Generalization-Learning.md to
 Training-Generalization-Reinforcement-Learning-Agents.md

---
 ...d => Training-Generalization-Reinforcement-Learning-Agents.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename docs/{Training-Generalization-Learning.md => Training-Generalization-Reinforcement-Learning-Agents.md} (100%)

diff --git a/docs/Training-Generalization-Learning.md b/docs/Training-Generalization-Reinforcement-Learning-Agents.md
similarity index 100%
rename from docs/Training-Generalization-Learning.md
rename to docs/Training-Generalization-Reinforcement-Learning-Agents.md

From 9b54d7c7a6acbc87eaf722ef0268ae12f422c2ac Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:35:45 -0700
Subject: [PATCH 13/49] Updated doc link for generalization

---
 docs/ML-Agents-Overview.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/ML-Agents-Overview.md b/docs/ML-Agents-Overview.md
index 8f44928bba..2ac2b14c37 100644
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
@@ -413,7 +413,7 @@ training process.
   to several variations of the environment. Similar to Curriculum Learning,
   where environments become more difficult as the agent learns, the toolkit provides
   a way to randomly sample Reset Parameters of the environment during training. See
-  [Training with Environment Parameter Sampling](Training-Generalization-Learning.md)
+  [Training Generalized Reinforcement Learning Agents](Training-Generalized-Reinforcement-Learning-Agents.md)
   to learn more about this feature.
   
 - **Broadcasting** - As discussed earlier, a Learning Brain sends the

From 036334951f37d3f4498e01499527761e7afc05c5 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:35:55 -0700
Subject: [PATCH 14/49] Rename
 Training-Generalization-Reinforcement-Learning-Agents.md to
 Training-Generalized-Reinforcement-Learning-Agents.md

---
 ...s.md => Training-Generalized-Reinforcement-Learning-Agents.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename docs/{Training-Generalization-Reinforcement-Learning-Agents.md => Training-Generalized-Reinforcement-Learning-Agents.md} (100%)

diff --git a/docs/Training-Generalization-Reinforcement-Learning-Agents.md b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
similarity index 100%
rename from docs/Training-Generalization-Reinforcement-Learning-Agents.md
rename to docs/Training-Generalized-Reinforcement-Learning-Agents.md

From b70148ef09c416cc042e0968a21fdfa8f6613d91 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:43:42 -0700
Subject: [PATCH 15/49] Re-wrote the intro paragraph for generalization

---
 ...neralized-Reinforcement-Learning-Agents.md | 21 ++++++++++---------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/docs/Training-Generalized-Reinforcement-Learning-Agents.md b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
index 79dea8da9e..b784109763 100644
--- a/docs/Training-Generalized-Reinforcement-Learning-Agents.md
+++ b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
@@ -1,15 +1,16 @@
 # Training Generalized Reinforcement Learning Agents
 
-Reinforcement learning has a rather unique setup as opposed to supervised and
-unsupervised learning. Agents here are trained and tested on the same exact 
-environment, which is analogous to a model being trained and tested on an 
-identical dataset in supervised learning! This setting results in overfitting; 
-the inability of the agent to generalize to slight tweaks or variations in the 
-environment. This is problematic in instances when environments are randomly 
-instantiated with varying properties. To make agents robust, one approach is to
-train an agent over multiple variations of the environment. The agent is 
-trained in this approach with the intent that it learns to adapt its performance 
-to future unseen variations of the environment.
+One of the challenges of training and testing agents on the same
+environment is that the agents tend to overfit. The result is that the
+agents are unable to generalize to any tweaks or variations in the enviornment.
+This is analgous to a model being trained and tested on an identical dataset
+in supervised learning. This becomes problematic in cases where environments
+are randomly instantiated with varying objects or properties. 
+
+To make agents robust and generalizable to different environments, the agent
+should be trained over multiple variations of the enviornment. Using this approach
+for training, the agent will be better suited to adapt (with higher performance)
+to future unseen variations of the enviornment
 
 Ball scale of 0.5          |  Ball scale of 4
 :-------------------------:|:-------------------------:

From 97e3991d071141ca04b3ff9b7743cee31d513335 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:50:21 -0700
Subject: [PATCH 16/49] add titles, cleaned up language for reset params

---
 ...ning-Generalized-Reinforcement-Learning-Agents.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/docs/Training-Generalized-Reinforcement-Learning-Agents.md b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
index b784109763..9cd7ce3389 100644
--- a/docs/Training-Generalized-Reinforcement-Learning-Agents.md
+++ b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
@@ -12,16 +12,18 @@ should be trained over multiple variations of the enviornment. Using this approa
 for training, the agent will be better suited to adapt (with higher performance)
 to future unseen variations of the enviornment
 
+_Variations of the 3D Ball environment._
+
 Ball scale of 0.5          |  Ball scale of 4
 :-------------------------:|:-------------------------:
 ![](images/3dball_small.png)  |  ![](images/3dball_big.png)
 
-_Variations of the 3D Ball environment._
+## Generalization Using Reset Parameters
 
-To vary environments, we first decide what parameters to vary in an
-environment. We call these parameters `Reset Parameters`. In the 3D ball 
-environment example displayed in the figure above, the reset parameters are 
-`gravity`, `ball_mass` and `ball_scale`.
+To enable variations in the environments, we implemented `Reset Parameters`. We
+also specify a range of default values for each `Reset Parameter` and sample
+these parameters during training. In the 3D ball environment example displayed
+in the figure above, the reset parameters are `gravity`, `ball_mass` and `ball_scale`.
 
 
 ## How-to

From b31dd283bd0084b61484f4526609b46169f20193 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 21:51:21 -0700
Subject: [PATCH 17/49] Update
 Training-Generalized-Reinforcement-Learning-Agents.md

---
 docs/Training-Generalized-Reinforcement-Learning-Agents.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/Training-Generalized-Reinforcement-Learning-Agents.md b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
index 9cd7ce3389..5ef2089dd3 100644
--- a/docs/Training-Generalized-Reinforcement-Learning-Agents.md
+++ b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
@@ -18,7 +18,7 @@ Ball scale of 0.5          |  Ball scale of 4
 :-------------------------:|:-------------------------:
 ![](images/3dball_small.png)  |  ![](images/3dball_big.png)
 
-## Generalization Using Reset Parameters
+## Introducing Generalization Using Reset Parameters
 
 To enable variations in the environments, we implemented `Reset Parameters`. We
 also specify a range of default values for each `Reset Parameter` and sample
@@ -26,7 +26,7 @@ these parameters during training. In the 3D ball environment example displayed
 in the figure above, the reset parameters are `gravity`, `ball_mass` and `ball_scale`.
 
 
-## How-to
+## How to Enable Generalization Using Reset Parameters
 
 For generalization training, we need to provide a way to modify the environment 
 by supplying a set of reset parameters, and vary them over time. This provision

From da2dace98daa5c4607fe8107b3cd9a77dfc01e4a Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 22:04:23 -0700
Subject: [PATCH 18/49] cleanup of generalization doc

---
 ...neralized-Reinforcement-Learning-Agents.md | 44 +++++++++----------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/docs/Training-Generalized-Reinforcement-Learning-Agents.md b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
index 5ef2089dd3..9bd4ed0865 100644
--- a/docs/Training-Generalized-Reinforcement-Learning-Agents.md
+++ b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
@@ -21,27 +21,26 @@ Ball scale of 0.5          |  Ball scale of 4
 ## Introducing Generalization Using Reset Parameters
 
 To enable variations in the environments, we implemented `Reset Parameters`. We
-also specify a range of default values for each `Reset Parameter` and sample
+also specify a range of values for each `Reset Parameter` and sample
 these parameters during training. In the 3D ball environment example displayed
 in the figure above, the reset parameters are `gravity`, `ball_mass` and `ball_scale`.
 
 
 ## How to Enable Generalization Using Reset Parameters
 
-For generalization training, we need to provide a way to modify the environment 
-by supplying a set of reset parameters, and vary them over time. This provision
-can be done either deterministically or randomly. 
+We need to provide a way to modify the environment by supplying a set of `Reset Parameters`,
+and vary them over time. This provision can be done either deterministically or randomly. 
 
 This is done by assigning each reset parameter a sampler, which samples a reset
 parameter value (such as a uniform sampler). If a sampler isn't provided for a
-reset parameter, the parameter maintains the default value throughout the 
-training procedure, remaining unchanged. The samplers for all the reset parameters 
+`Reset Parameter`, the parameter maintains the default value throughout the 
+training procedure, remaining unchanged. The samplers for all the `Reset Parameters` 
 are handled by a **Sampler Manager**, which also handles the generation of new 
 values for the reset parameters when needed. 
 
-To setup the Sampler Manager, we setup a YAML file that specifies how we wish to 
-generate new samples. In this file, we specify the samplers and the 
-`resampling-interval` (number of simulation steps after which reset parameters are 
+To setup the Sampler Manager, we create a YAML file that specifies how we wish to 
+generate new samples for each `Reset Parameters`. In this file, we specify the samplers and the 
+`resampling-interval` (the number of simulation steps after which reset parameters are 
 resampled). Below is an example of a sampler file for the 3D ball environment.
 
 ```yaml
@@ -63,26 +62,27 @@ scale:
 
 ```
 
-* `resampling-interval` (int) - Specifies the number of steps for agent to 
+Below is the explanation of the fields in the above example.
+
+* `resampling-interval` - Specifies the number of steps for the agent to 
 train under a particular environment configuration before resetting the 
-environment with a new sample of reset parameters.
+environment with a new sample of `Reset Parameters`.
 
-* `parameter_name` - Name of the reset parameter. This should match the name 
+* `parameter_name` - Name of the `Reset Parameter`. This should match the name 
 specified in the academy of the intended environment for which the agent is 
 being trained. If a parameter specified in the file doesn't exist in the 
-environment, then this specification will be ignored.
+environment, then this parameter will be ignored.
 
-    * `sampler-type` - Specify the sampler type to use for the reset parameter. 
+    * `sampler-type` - Specify the sampler type to use for the `Reset Parameter`. 
     This is a string that should exist in the `Sampler Factory` (explained 
     below).
 
-    * `sub-arguments` - Specify the characteristic parameters for the sampler. 
-    In the example sampler file above, this would correspond to the `intervals` 
-    key under the `multirange_uniform` sampler for the gravity reset parameter. 
-    The key name should match the name of the corresponding argument in the sampler definition. (Look at defining a new sampler method)
-
+    * `sub-arguments` - Specify the sub-arguments depending on the `sampler-type`. 
+    In the example above, this would correspond to the `intervals` 
+    under the `sampler-type` `multirange_uniform` for the `Reset Parameter` called gravity`. 
+    The key name should match the name of the corresponding argument in the sampler definition. (See below)
 
-The sampler manager allocates a sampler for a reset parameter by using the *Sampler Factory*, which maintains a dictionary mapping of string keys to sampler objects. The available samplers to be used for reset parameter resampling is as available in the Sampler Factory.
+The sampler manager allocates a sampler for a `Reset Parameter` by using the *Sampler Factory*, which maintains a dictionary mapping of string keys to sampler objects. The available samplers to be used for `Reset Parameter` resampling is as available in the Sampler Factory.
 
 #### Possible Sampler Types
 
@@ -125,7 +125,7 @@ This can be done by subscribing to the *register_sampler* method of the SamplerF
 
 `SamplerFactory.register_sampler(*custom_sampler_string_key*, *custom_sampler_object*)`
 
-Once the Sampler Factory reflects the new register, the custom sampler can be used for resampling reset parameter. For demonstration, lets say our sampler was implemented as below, and we register the `CustomSampler` class with the string `custom-sampler` in the Sampler Factory.
+Once the Sampler Factory reflects the new register, the custom sampler can be used for resampling `Reset Parameter`. For demonstration, lets say our sampler was implemented as below, and we register the `CustomSampler` class with the string `custom-sampler` in the Sampler Factory.
 
 ```python
 class CustomSampler(Sampler):
@@ -137,7 +137,7 @@ class CustomSampler(Sampler):
         return np.random.choice(self.possible_vals)
 ```
 
-Now we need to specify this sampler in the sampler file. Lets say we wish to use this sampler for the reset parameter *mass*; the sampler file would specify the same for mass as the following (any order of the subarguments is valid).
+Now we need to specify this sampler in the sampler file. Lets say we wish to use this sampler for the `Reset Parameter` *mass*; the sampler file would specify the same for mass as the following (any order of the subarguments is valid).
 
 ```yaml
 mass:

From 8b44655f37fd9bb31e2372bffb55c7869184a8c5 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 22:07:23 -0700
Subject: [PATCH 19/49] More cleanup in generalization

---
 docs/Training-Generalized-Reinforcement-Learning-Agents.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/docs/Training-Generalized-Reinforcement-Learning-Agents.md b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
index 9bd4ed0865..4f73ecd4ae 100644
--- a/docs/Training-Generalized-Reinforcement-Learning-Agents.md
+++ b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
@@ -80,9 +80,12 @@ environment, then this parameter will be ignored.
     * `sub-arguments` - Specify the sub-arguments depending on the `sampler-type`. 
     In the example above, this would correspond to the `intervals` 
     under the `sampler-type` `multirange_uniform` for the `Reset Parameter` called gravity`. 
-    The key name should match the name of the corresponding argument in the sampler definition. (See below)
+    The key name should match the name of the corresponding argument in the sampler definition. 
+    (See below)
 
-The sampler manager allocates a sampler for a `Reset Parameter` by using the *Sampler Factory*, which maintains a dictionary mapping of string keys to sampler objects. The available samplers to be used for `Reset Parameter` resampling is as available in the Sampler Factory.
+The Sampler Manager allocates a sampler for each `Reset Parameter` by using the *Sampler Factory*,
+which maintains a dictionary mapping of string keys to sampler objects. The available samplers
+to be used for each `Reset Parameter` is available in the Sampler Factory.
 
 #### Possible Sampler Types
 

From d6e870a3b13a1254deb104920998a1e887259f31 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 22:07:49 -0700
Subject: [PATCH 20/49] Fixed title

---
 docs/Training-Generalized-Reinforcement-Learning-Agents.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Training-Generalized-Reinforcement-Learning-Agents.md b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
index 4f73ecd4ae..ba8fc8c762 100644
--- a/docs/Training-Generalized-Reinforcement-Learning-Agents.md
+++ b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
@@ -87,7 +87,7 @@ The Sampler Manager allocates a sampler for each `Reset Parameter` by using the
 which maintains a dictionary mapping of string keys to sampler objects. The available samplers
 to be used for each `Reset Parameter` is available in the Sampler Factory.
 
-#### Possible Sampler Types
+### Possible Sampler Types
 
 The currently implemented samplers that can be used with the `sampler-type` arguments are:
 

From d7012808de2ba7414939571a7ab115b168d3e036 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 22:10:06 -0700
Subject: [PATCH 21/49] Clean up included sampler type section

---
 docs/Training-Generalized-Reinforcement-Learning-Agents.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/docs/Training-Generalized-Reinforcement-Learning-Agents.md b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
index ba8fc8c762..7b87a62de2 100644
--- a/docs/Training-Generalized-Reinforcement-Learning-Agents.md
+++ b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
@@ -87,9 +87,9 @@ The Sampler Manager allocates a sampler for each `Reset Parameter` by using the
 which maintains a dictionary mapping of string keys to sampler objects. The available samplers
 to be used for each `Reset Parameter` is available in the Sampler Factory.
 
-### Possible Sampler Types
+### Included Sampler Types
 
-The currently implemented samplers that can be used with the `sampler-type` arguments are:
+Below is a list of included `sampler-type` as part of the toolkit.
 
 * `uniform` - Uniform sampler
     *   Uniformly samples a single float value between defined endpoints. 
@@ -117,7 +117,6 @@ The currently implemented samplers that can be used with the `sampler-type` argu
     
     * **sub-arguments** - `intervals`
 
-
 The implementation of the samplers can be found at `ml-agents-envs/mlagents/envs/sampler_class.py`.
 
 ### Defining a new sampler method

From c641a5be8d21d3b56dd12fe0a3bc8082b68fc3b7 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 22:14:27 -0700
Subject: [PATCH 22/49] cleaned up defining new sampler type in generalization

---
 ...eneralized-Reinforcement-Learning-Agents.md | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/docs/Training-Generalized-Reinforcement-Learning-Agents.md b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
index 7b87a62de2..92c025c780 100644
--- a/docs/Training-Generalized-Reinforcement-Learning-Agents.md
+++ b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
@@ -119,15 +119,20 @@ Below is a list of included `sampler-type` as part of the toolkit.
 
 The implementation of the samplers can be found at `ml-agents-envs/mlagents/envs/sampler_class.py`.
 
-### Defining a new sampler method
+### Defining a New Sampler Type
 
-Custom sampling techniques must inherit from the *Sampler* base class (included in the `sampler_class` file) and preserve the interface. Once the class for the required method is specified, it must be registered in the Sampler Factory. 
+If you want to define your own sampler type, you must first inherit the *Sampler*
+base class (included in the `sampler_class` file) and preserve the interface.
+Once the class for the required method is specified, it must be registered in the Sampler Factory. 
 
-This can be done by subscribing to the *register_sampler* method of the SamplerFactory. The command is as follows:
+This can be done by subscribing to the *register_sampler* method of the SamplerFactory. The command
+is as follows:
 
 `SamplerFactory.register_sampler(*custom_sampler_string_key*, *custom_sampler_object*)`
 
-Once the Sampler Factory reflects the new register, the custom sampler can be used for resampling `Reset Parameter`. For demonstration, lets say our sampler was implemented as below, and we register the `CustomSampler` class with the string `custom-sampler` in the Sampler Factory.
+Once the Sampler Factory reflects the new register, the new sampler type can be used for sample any
+`Reset Parameter`. For example, lets say a new sampler type was implemented as below and we register
+the `CustomSampler` class with the string `custom-sampler` in the Sampler Factory.
 
 ```python
 class CustomSampler(Sampler):
@@ -139,7 +144,8 @@ class CustomSampler(Sampler):
         return np.random.choice(self.possible_vals)
 ```
 
-Now we need to specify this sampler in the sampler file. Lets say we wish to use this sampler for the `Reset Parameter` *mass*; the sampler file would specify the same for mass as the following (any order of the subarguments is valid).
+Now we need to specify the new sampler type in the sampler YAML file. For example, we use this new
+sampler type for the `Reset Parameter` *mass*.
 
 ```yaml
 mass:
@@ -149,8 +155,6 @@ mass:
     argC: 3
 ```
 
-With the sampler file setup, we can proceed to train our agent as explained in the next section.
-
 ### Training with Generalization Learning
 
 We first begin with setting up the sampler file. After the sampler file is defined and configured, we proceed by launching `mlagents-learn` and specify our configured sampler file with the `--sampler` flag. To demonstrate, if we wanted to train a 3D ball agent with generalization using the `config/3dball_generalize.yaml` sampling setup, we can run

From c79392bcada4d0bef48bbb0e709664e9ca1e26b2 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 22:17:47 -0700
Subject: [PATCH 23/49] cleaned up training section of generalization

---
 ...aining-Generalized-Reinforcement-Learning-Agents.md | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/docs/Training-Generalized-Reinforcement-Learning-Agents.md b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
index 92c025c780..31ecfe15b9 100644
--- a/docs/Training-Generalized-Reinforcement-Learning-Agents.md
+++ b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
@@ -155,12 +155,16 @@ mass:
     argC: 3
 ```
 
-### Training with Generalization Learning
+### Training with Generalization Using Reset Parameters
 
-We first begin with setting up the sampler file. After the sampler file is defined and configured, we proceed by launching `mlagents-learn` and specify our configured sampler file with the `--sampler` flag. To demonstrate, if we wanted to train a 3D ball agent with generalization using the `config/3dball_generalize.yaml` sampling setup, we can run
+After the sampler YAML file is defined, we proceed by launching `mlagents-learn` and specify
+our configured sampler file with the `--sampler` flag. For example, if we wanted to train the
+3D ball agent with generalization using `Reset Parameters` with `config/3dball_generalize.yaml`
+sampling setup, we would run
 
 ```sh
-mlagents-learn config/trainer_config.yaml --sampler=config/3dball_generalize.yaml --run-id=3D-Ball-generalization --train
+mlagents-learn config/trainer_config.yaml --sampler=config/3dball_generalize.yaml 
+--run-id=3D-Ball-generalization --train
 ```
 
 We can observe progress and metrics via Tensorboard.

From 26321ed9e22f6f86489cafa99c1f3091cb5b2d50 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Mon, 29 Jul 2019 22:30:51 -0700
Subject: [PATCH 24/49] final cleanup for generalization

---
 ...neralized-Reinforcement-Learning-Agents.md | 27 ++++++++++---------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/docs/Training-Generalized-Reinforcement-Learning-Agents.md b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
index 31ecfe15b9..29210781ce 100644
--- a/docs/Training-Generalized-Reinforcement-Learning-Agents.md
+++ b/docs/Training-Generalized-Reinforcement-Learning-Agents.md
@@ -12,7 +12,7 @@ should be trained over multiple variations of the enviornment. Using this approa
 for training, the agent will be better suited to adapt (with higher performance)
 to future unseen variations of the enviornment
 
-_Variations of the 3D Ball environment._
+_Example of variations of the 3D Ball environment._
 
 Ball scale of 0.5          |  Ball scale of 4
 :-------------------------:|:-------------------------:
@@ -21,18 +21,19 @@ Ball scale of 0.5          |  Ball scale of 4
 ## Introducing Generalization Using Reset Parameters
 
 To enable variations in the environments, we implemented `Reset Parameters`. We
-also specify a range of values for each `Reset Parameter` and sample
-these parameters during training. In the 3D ball environment example displayed
+also included different sampling methods and the ability to create new kinds of
+sampling methods for each `Reset Parameter`. In the 3D ball environment example displayed
 in the figure above, the reset parameters are `gravity`, `ball_mass` and `ball_scale`.
 
 
 ## How to Enable Generalization Using Reset Parameters
 
-We need to provide a way to modify the environment by supplying a set of `Reset Parameters`,
+We first need to provide a way to modify the environment by supplying a set of `Reset Parameters`
 and vary them over time. This provision can be done either deterministically or randomly. 
 
-This is done by assigning each reset parameter a sampler, which samples a reset
-parameter value (such as a uniform sampler). If a sampler isn't provided for a
+This is done by assigning each `Reset Parameter` a `sampler-type`(such as a uniform sampler), 
+which determines how to sample a `Reset
+Parameter`. If a `sampler-type` isn't provided for a
 `Reset Parameter`, the parameter maintains the default value throughout the 
 training procedure, remaining unchanged. The samplers for all the `Reset Parameters` 
 are handled by a **Sampler Manager**, which also handles the generation of new 
@@ -68,23 +69,23 @@ Below is the explanation of the fields in the above example.
 train under a particular environment configuration before resetting the 
 environment with a new sample of `Reset Parameters`.
 
-* `parameter_name` - Name of the `Reset Parameter`. This should match the name 
+* `Reset Parameter` - Name of the `Reset Parameter` like `mass`, `gravity` and `scale`. This should match the name 
 specified in the academy of the intended environment for which the agent is 
 being trained. If a parameter specified in the file doesn't exist in the 
-environment, then this parameter will be ignored.
+environment, then this parameter will be ignored.  Within each `Reset Parameter`
 
     * `sampler-type` - Specify the sampler type to use for the `Reset Parameter`. 
     This is a string that should exist in the `Sampler Factory` (explained 
     below).
 
-    * `sub-arguments` - Specify the sub-arguments depending on the `sampler-type`. 
+    * `sampler-type-sub-arguments` - Specify the sub-arguments depending on the `sampler-type`. 
     In the example above, this would correspond to the `intervals` 
-    under the `sampler-type` `multirange_uniform` for the `Reset Parameter` called gravity`. 
+    under the `sampler-type` `"multirange_uniform"` for the `Reset Parameter` called gravity`. 
     The key name should match the name of the corresponding argument in the sampler definition. 
     (See below)
 
-The Sampler Manager allocates a sampler for each `Reset Parameter` by using the *Sampler Factory*,
-which maintains a dictionary mapping of string keys to sampler objects. The available samplers
+The Sampler Manager allocates a sampler type for each `Reset Parameter` by using the *Sampler Factory*,
+which maintains a dictionary mapping of string keys to sampler objects. The available sampler types
 to be used for each `Reset Parameter` is available in the Sampler Factory.
 
 ### Included Sampler Types
@@ -106,7 +107,7 @@ Below is a list of included `sampler-type` as part of the toolkit.
 
     * **sub-arguments** - `mean`, `st_dev`
 
-* `multirange_uniform` - Multirange Uniform sampler
+* `multirange_uniform` - Multirange uniform sampler
     *   Uniformly samples a single float value between the specified intervals. 
         Samples by first performing a weight pick of an interval from the list 
         of intervals (weighted based on interval width) and samples uniformly 

From 9bc08609be905d728c1aa4399a92dec8614bad5f Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:14:21 -0700
Subject: [PATCH 25/49] Clean up of Training w Imitation Learning doc

---
 docs/Training-Imitation-Learning.md | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/docs/Training-Imitation-Learning.md b/docs/Training-Imitation-Learning.md
index 2b834d5a2f..0766d01cd9 100644
--- a/docs/Training-Imitation-Learning.md
+++ b/docs/Training-Imitation-Learning.md
@@ -12,19 +12,21 @@ from a demonstration to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFY
 
 Imitation learning can also be used to help reinforcement learning. Especially in
 environments with sparse (i.e., infrequent or rare) rewards, the agent may never see
-the reward and thus not learn from it. Curiosity helps the agent explore, but in some cases
-it is easier to just show the agent how to achieve the reward. In these cases,
-imitation learning can dramatically reduce the time it takes to solve the environment.
+the reward and thus not learn from it. Curiosity (which is available in the toolkit) 
+helps the agent explore, but in some cases
+it is easier to show the agent how to achieve the reward. In these cases,
+imitation learning combined with reinforcement learning can dramatically 
+reduce the time the agent takes to solve the environment.
 For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
-just 6 episodes of demonstrations can reduce training steps by more than 4 times.
+using 6 episodes of demonstrations (recorded from a human player) can reduce training steps by more than 4 times. See PreTraining + GAIL + Curiosity + RL below.
 
 <p align="center">
   <img src="images/mlagents-ImitationAndRL.png"
        alt="Using Demonstrations with Reinforcement Learning"
-       width="350" border="0" />
+       width="700" border="0" />
 </p>
 
-ML-Agents provides several ways to learn from demonstrations.
+The ML-Agents toolkit provides several ways to learn from demonstrations.
 
 * To train using GAIL (Generative Adversarial Imitaiton Learning) you can add the
   [GAIL reward signal](Training-RewardSignals.md#the-gail-reward-signal). GAIL can be

From 1290bb7dc97200d23bc0a1bf944144649f461253 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:16:49 -0700
Subject: [PATCH 26/49] updated link for generalization, reordered

---
 docs/Training-ML-Agents.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Training-ML-Agents.md b/docs/Training-ML-Agents.md
index a36bfcca71..3e677297dc 100644
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
@@ -196,8 +196,8 @@ are conducting, see:
 * [Training with PPO](Training-PPO.md)
 * [Using Recurrent Neural Networks](Feature-Memory.md)
 * [Training with Curriculum Learning](Training-Curriculum-Learning.md)
-* [Training with Environment Parameter Sampling](Training-Generalization-Learning.md)
 * [Training with Imitation Learning](Training-Imitation-Learning.md)
+* [Training Generalized Reinforcement Learning Agents](Training-Generalized-Reinforcement-Learning-Agents.md)
 
 You can also compare the
 [example environments](Learning-Environment-Examples.md)

From 4b5d05ef325e2bfe4bcfe2a546d0f7250552668f Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:20:12 -0700
Subject: [PATCH 27/49] consistency fix

---
 docs/Training-ML-Agents.md | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/docs/Training-ML-Agents.md b/docs/Training-ML-Agents.md
index 3e677297dc..8b9f2d9f9c 100644
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
@@ -97,57 +97,57 @@ In addition to passing the path of the Unity executable containing your training
 environment, you can set the following command line options when invoking
 `mlagents-learn`:
 
-* `--env=<env>` - Specify an executable environment to train.
-* `--curriculum=<file>` – Specify a curriculum JSON file for defining the
+* `--env=<env>`: Specify an executable environment to train.
+* `--curriculum=<file>`: Specify a curriculum JSON file for defining the
   lessons for curriculum training. See [Curriculum
   Training](Training-Curriculum-Learning.md) for more information.
-* `--sampler=<file>` - Specify a sampler YAML file for defining the
+* `--sampler=<file>`: Specify a sampler YAML file for defining the
   sampler for generalization training. See [Generalization
   Training](Training-Generalization-Learning.md) for more information.
-* `--keep-checkpoints=<n>` – Specify the maximum number of model checkpoints to
+* `--keep-checkpoints=<n>`: Specify the maximum number of model checkpoints to
   keep. Checkpoints are saved after the number of steps specified by the
   `save-freq` option. Once the maximum number of checkpoints has been reached,
   the oldest checkpoint is deleted when saving a new checkpoint. Defaults to 5.
-* `--lesson=<n>` – Specify which lesson to start with when performing curriculum
+* `--lesson=<n>`: Specify which lesson to start with when performing curriculum
   training. Defaults to 0.
-* `--load` – If set, the training code loads an already trained model to
+* `--load`: If set, the training code loads an already trained model to
   initialize the neural network before training. The learning code looks for the
   model in `models/<run-id>/` (which is also where it saves models at the end of
   training). When not set (the default), the neural network weights are randomly
   initialized and an existing model is not loaded.
-* `--num-runs=<n>` - Sets the number of concurrent training sessions to perform.
+* `--num-runs=<n>`: Sets the number of concurrent training sessions to perform.
   Default is set to 1. Set to higher values when benchmarking performance and
   multiple training sessions is desired. Training sessions are independent, and
   do not improve learning performance.
-* `--run-id=<path>` – Specifies an identifier for each training run. This
+* `--run-id=<path>`: Specifies an identifier for each training run. This
   identifier is used to name the subdirectories in which the trained model and
   summary statistics are saved as well as the saved model itself. The default id
   is "ppo". If you use TensorBoard to view the training statistics, always set a
   unique run-id for each training run. (The statistics for all runs with the
   same id are combined as if they were produced by a the same session.)
-* `--save-freq=<n>` Specifies how often (in  steps) to save the model during
+* `--save-freq=<n>`: Specifies how often (in  steps) to save the model during
   training. Defaults to 50000.
-* `--seed=<n>` – Specifies a number to use as a seed for the random number
+* `--seed=<n>`: Specifies a number to use as a seed for the random number
   generator used by the training code.
-* `--slow` – Specify this option to run the Unity environment at normal, game
+* `--slow`: Specify this option to run the Unity environment at normal, game
   speed. The `--slow` mode uses the **Time Scale** and **Target Frame Rate**
   specified in the Academy's **Inference Configuration**. By default, training
   runs using the speeds specified in your Academy's **Training Configuration**.
   See
   [Academy Properties](Learning-Environment-Design-Academy.md#academy-properties).
-* `--train` – Specifies whether to train model or only run in inference mode.
+* `--train`: Specifies whether to train model or only run in inference mode.
   When training, **always** use the `--train` option.
-* `--num-envs=<n>` - Specifies the number of concurrent Unity environment instances to collect
+* `--num-envs=<n>`: Specifies the number of concurrent Unity environment instances to collect
   experiences from when training. Defaults to 1.
-* `--base-port` - Specifies the starting port. Each concurrent Unity environment instance will get assigned a port sequentially, starting from the `base-port`.  Each instance will use the port `(base_port + worker_id)`, where the `worker_id` is sequential IDs given to each instance from 0 to `num_envs - 1`. Default is 5005.
-* `--docker-target-name=<dt>` – The Docker Volume on which to store curriculum,
+* `--base-port`: Specifies the starting port. Each concurrent Unity environment instance will get assigned a port sequentially, starting from the `base-port`.  Each instance will use the port `(base_port + worker_id)`, where the `worker_id` is sequential IDs given to each instance from 0 to `num_envs - 1`. Default is 5005.
+* `--docker-target-name=<dt>`: The Docker Volume on which to store curriculum,
   executable and model files. See [Using Docker](Using-Docker.md).
-* `--no-graphics` - Specify this option to run the Unity executable in
+* `--no-graphics`: Specify this option to run the Unity executable in
   `-batchmode` and doesn't initialize the graphics driver. Use this only if your
   training doesn't involve visual observations (reading from Pixels). See
   [here](https://docs.unity3d.com/Manual/CommandLineArguments.html) for more
   details.
-* `--debug` - Specify this option to enable debug-level logging for some parts of the code.
+* `--debug`: Specify this option to enable debug-level logging for some parts of the code.
 
 ### Training config file
 

From 05d85a91e9b763c1ec36b98bac1178d374d2a459 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:37:38 -0700
Subject: [PATCH 28/49] cleaned up training ml agents doc

---
 docs/Training-ML-Agents.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/docs/Training-ML-Agents.md b/docs/Training-ML-Agents.md
index 8b9f2d9f9c..725e211979 100644
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
@@ -91,7 +91,7 @@ While this example used the default training hyperparameters, you can edit the
 [training_config.yaml file](#training-config-file) with a text editor to set
 different values.
 
-### Command line training options
+### Command Line Training Options
 
 In addition to passing the path of the Unity executable containing your training
 environment, you can set the following command line options when invoking
@@ -149,7 +149,7 @@ environment, you can set the following command line options when invoking
   details.
 * `--debug`: Specify this option to enable debug-level logging for some parts of the code.
 
-### Training config file
+### Training Config File
 
 The training config files `config/trainer_config.yaml`,
 `config/online_bc_config.yaml` and `config/offline_bc_config.yaml` specifies the
@@ -205,8 +205,9 @@ to the corresponding sections of the `config/trainer_config.yaml` file for each
 example to see how the hyperparameters and other configuration variables have
 been changed from the defaults.
 
-### Output metrics
-Trainer Metrics are logged to a CSV stored in the `summaries` directory. The metrics stored are:
+### Debugging and Profiling
+If you enable the `--debug` flag in the command line, the trainer metrics are logged to a CSV file 
+stored in the `summaries` directory. The metrics stored are:
   * brain name
   * time to update policy
   * time since start of training
@@ -216,4 +217,5 @@ Trainer Metrics are logged to a CSV stored in the `summaries` directory. The met
   
 This option is not available currently for Behavioral Cloning.
 
-[Profiling](Profiling.md) information is also saved in the `summaries` directory.
+Additionally, we have included basic [Python Profiling](Profiling.md) as part of the toolkit.
+This information is also saved in the `summaries` directory.

From a573ef22dc75812084b31c466fbccc273c653a36 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:38:16 -0700
Subject: [PATCH 29/49] Update and rename Profiling.md to Profiling-Python.md

---
 docs/{Profiling.md => Profiling-Python.md} | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
 rename docs/{Profiling.md => Profiling-Python.md} (98%)

diff --git a/docs/Profiling.md b/docs/Profiling-Python.md
similarity index 98%
rename from docs/Profiling.md
rename to docs/Profiling-Python.md
index 1fc28dd314..8a52ef02ae 100644
--- a/docs/Profiling.md
+++ b/docs/Profiling-Python.md
@@ -1,4 +1,4 @@
-# Profiling ML-Agents in Python
+# Profiling in Python
 
 ML-Agents provides a lightweight profiling system, in order to identity hotspots in the training process and help spot
 regressions from changes.

From 106ebddc3eb56006a2dffc04dd773d12e3bcfdcf Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:38:53 -0700
Subject: [PATCH 30/49] Updated Python profiling link

---
 docs/Training-ML-Agents.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Training-ML-Agents.md b/docs/Training-ML-Agents.md
index 725e211979..86c483cb1a 100644
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
@@ -217,5 +217,5 @@ stored in the `summaries` directory. The metrics stored are:
   
 This option is not available currently for Behavioral Cloning.
 
-Additionally, we have included basic [Python Profiling](Profiling.md) as part of the toolkit.
+Additionally, we have included basic [Profiling in Python](Profiling-Python.md) as part of the toolkit.
 This information is also saved in the `summaries` directory.

From 0e59547b6bcae7e265dc6bd1d7ebb1fc05b6f249 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:40:14 -0700
Subject: [PATCH 31/49] minor clean up in profiling doc

---
 docs/Profiling-Python.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/docs/Profiling-Python.md b/docs/Profiling-Python.md
index 8a52ef02ae..45904b883e 100644
--- a/docs/Profiling-Python.md
+++ b/docs/Profiling-Python.md
@@ -1,7 +1,7 @@
 # Profiling in Python
 
-ML-Agents provides a lightweight profiling system, in order to identity hotspots in the training process and help spot
-regressions from changes.
+As part of the ML-Agents tookit, we provide a lightweight profiling system,
+in order to identity hotspots in the training process and help spot regressions from changes.
 
 Timers are hierarchical, meaning that the time tracked in a block of code can be further split into other blocks if
 desired. This also means that a function that is called from multiple places in the code will appear in multiple
@@ -24,7 +24,6 @@ class TrainerController:
 
 You can also used the `hierarchical_timer` context manager.
 
-
 ``` python
 with hierarchical_timer("communicator.exchange"):
     outputs = self.communicator.exchange(step_input)

From c58873c339a042cad50680ea0b7dab152bbd2664 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:41:18 -0700
Subject: [PATCH 32/49] Rename Training-BehavioralCloning.md to
 Training-Behavioral-Cloning.md

---
 ...aining-BehavioralCloning.md => Training-Behavioral-Cloning.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename docs/{Training-BehavioralCloning.md => Training-Behavioral-Cloning.md} (100%)

diff --git a/docs/Training-BehavioralCloning.md b/docs/Training-Behavioral-Cloning.md
similarity index 100%
rename from docs/Training-BehavioralCloning.md
rename to docs/Training-Behavioral-Cloning.md

From 1eeb52e883d50c021c40e90496caa8eac1a34bbe Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:41:54 -0700
Subject: [PATCH 33/49] Updated link to BC

---
 docs/Training-Imitation-Learning.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Training-Imitation-Learning.md b/docs/Training-Imitation-Learning.md
index 0766d01cd9..7c41eac733 100644
--- a/docs/Training-Imitation-Learning.md
+++ b/docs/Training-Imitation-Learning.md
@@ -36,7 +36,7 @@ The ML-Agents toolkit provides several ways to learn from demonstrations.
   [pretraining](Training-PPO.md#optional-pretraining-using-demonstrations)
   on the PPO trainer, in addition to using a small GAIL reward signal.
 * To train an agent to exactly mimic demonstrations, you can use the
-  [Behavioral Cloning](Training-BehavioralCloning.md) trainer. Behavioral Cloning can be
+  [Behavioral Cloning](Training-Behavioral-Cloning.md) trainer. Behavioral Cloning can be
   used offline and online (in-editor), and learns very quickly. However, it usually is ineffective
   on more complex environments without a large number of demonstrations.
 

From 468607059f3cc911913fcffda013e15c83f23072 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:43:53 -0700
Subject: [PATCH 34/49] Rename Training-RewardSignals.md to Reward-Signals.md

---
 docs/{Training-RewardSignals.md => Reward-Signals.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename docs/{Training-RewardSignals.md => Reward-Signals.md} (100%)

diff --git a/docs/Training-RewardSignals.md b/docs/Reward-Signals.md
similarity index 100%
rename from docs/Training-RewardSignals.md
rename to docs/Reward-Signals.md

From d8420043564be50bf9da016c7ed9ec1392c759b5 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:44:36 -0700
Subject: [PATCH 35/49] fix reward links to new

---
 docs/Training-PPO.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Training-PPO.md b/docs/Training-PPO.md
index d6cdddbd85..12e03c4758 100644
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md
@@ -8,7 +8,7 @@ ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
 Python process (communicating with the running Unity application over a socket).
 
 To train an agent, you will need to provide the agent one or more reward signals which
-the agent should attempt to maximize. See [Reward Signals](Training-RewardSignals.md)
+the agent should attempt to maximize. See [Reward Signals](Reward-Signals.md)
 for the available reward signals and the corresponding hyperparameters.
 
 See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the

From dc9dbb4feccdf4e32f1c66a5a89c6aec21ad6f42 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:46:37 -0700
Subject: [PATCH 36/49] cleaned up reward signal language

---
 docs/Training-PPO.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/Training-PPO.md b/docs/Training-PPO.md
index 12e03c4758..758b4c4871 100644
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md
@@ -42,10 +42,10 @@ rewarding the agent for various different behaviors. For instance, we could rewa
 the agent for exploring new states, rather than just when an explicit reward is given.
 Furthermore, we could mix reward signals to help the learning process.
 
-`reward_signals` provides a section to define [reward signals.](Training-RewardSignals.md)
-ML-Agents provides two reward signals by default, the Extrinsic (environment) reward, and the
+The hyperparameter `reward_signals` allows you to define [reward signals.](Training-RewardSignals.md)
+The ML-Agents toolkit provides two reward signals by default, the Extrinsic (environment) reward and the
 Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward
-environments. 
+environments. Please see [Reward Signals](Training-RewardSignals.md) for additional details..
 
 ### Lambda
 

From 15ed7f1f80fa81fd95d3e8c1d6a7cf1921ba2137 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:47:47 -0700
Subject: [PATCH 37/49] fixed broken links to reward signals

---
 docs/Training-PPO.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/Training-PPO.md b/docs/Training-PPO.md
index 758b4c4871..9babf08430 100644
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md
@@ -42,10 +42,10 @@ rewarding the agent for various different behaviors. For instance, we could rewa
 the agent for exploring new states, rather than just when an explicit reward is given.
 Furthermore, we could mix reward signals to help the learning process.
 
-The hyperparameter `reward_signals` allows you to define [reward signals.](Training-RewardSignals.md)
+Using `reward_signals` allows you to define [reward signals.](Reward-Signals.md)
 The ML-Agents toolkit provides two reward signals by default, the Extrinsic (environment) reward and the
 Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward
-environments. Please see [Reward Signals](Training-RewardSignals.md) for additional details..
+environments. Please see [Reward Signals](Reward-Signals.md) for additional details..
 
 ### Lambda
 

From 2ca95ae0e0971aa4d9940574dd66e930e5c4c86b Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:48:07 -0700
Subject: [PATCH 38/49] consistency fix

---
 docs/Training-PPO.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Training-PPO.md b/docs/Training-PPO.md
index 9babf08430..e3209949d6 100644
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md
@@ -25,7 +25,7 @@ Learning](Training-Curriculum-Learning.md).
 For information about imitation learning from demonstrations, see
 [Training with Imitation Learning](Training-Imitation-Learning.md).
 
-## Best Practices when training with PPO
+## Best Practices Training with PPO
 
 Successfully training a Reinforcement Learning model often involves tuning the
 training hyperparameters. This guide contains some best practices for tuning the

From 54a55a2783b7dbc087930e6f4484d945b22146db Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 07:58:22 -0700
Subject: [PATCH 39/49] Updated readme with generalization

---
 docs/Readme.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/Readme.md b/docs/Readme.md
index fdad80e4f5..f85ae59d80 100644
--- a/docs/Readme.md
+++ b/docs/Readme.md
@@ -39,6 +39,7 @@
 * [Training with Curriculum Learning](Training-Curriculum-Learning.md)
 * [Training with Imitation Learning](Training-Imitation-Learning.md)
 * [Training with LSTM](Feature-Memory.md)
+* [Training Generalized Reinforcement Learning Agents](Training-Generalized-Reinforcement-Learning-Agents.md)
 * [Training on the Cloud with Amazon Web Services](Training-on-Amazon-Web-Service.md)
 * [Training on the Cloud with Microsoft Azure](Training-on-Microsoft-Azure.md)
 * [Training Using Concurrent Unity Instances](Training-Using-Concurrent-Unity-Instances.md)

From e1f3faef456b2ac44de0f893de248b60a6f97bad Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 10:04:44 -0700
Subject: [PATCH 40/49] Added example for GAIL reward signal

---
 docs/Reward-Signals.md | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/docs/Reward-Signals.md b/docs/Reward-Signals.md
index 2f62402f1b..f1ee1c5174 100644
--- a/docs/Reward-Signals.md
+++ b/docs/Reward-Signals.md
@@ -18,9 +18,9 @@ The `curiosity` reward signal helps your agent explore when extrinsic rewards ar
 ## Enabling Reward Signals
 
 Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An
-example is provided in `config/trainer_config.yaml`. To enable a reward signal, add it to the
+example is provided in `config/trainer_config.yaml` and `config/gail_config.yaml`. To enable a reward signal, add it to the
 `reward_signals:` section under the brain name. For instance, to enable the extrinsic signal
-in addition to a small curiosity reward, you would define your `reward_signals` as follows:
+in addition to a small curiosity reward and a GAIL reward signal, you would define your `reward_signals` as follows:
 
 ```yaml
 reward_signals:
@@ -28,9 +28,14 @@ reward_signals:
         strength: 1.0
         gamma: 0.99
     curiosity:
+        strength: 0.02
+        gamma: 0.99
+        encoding_size: 256
+    gail:
         strength: 0.01
         gamma: 0.99
         encoding_size: 128
+        demo_path: demos/ExpertPyramid.demo
 ```
 
 Each reward signal should define at least two parameters, `strength` and `gamma`, in addition

From a3ff2b67c3b187503260ed91a08fce0a9ad7ddb0 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 11:36:32 -0700
Subject: [PATCH 41/49] minor fixes and consistency to Reward Signals

---
 docs/Reward-Signals.md | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/docs/Reward-Signals.md b/docs/Reward-Signals.md
index f1ee1c5174..b6d1f55b52 100644
--- a/docs/Reward-Signals.md
+++ b/docs/Reward-Signals.md
@@ -44,8 +44,9 @@ its entry entirely from `reward_signals`. At least one reward signal should be l
 at all times.
 
 ## Reward Signal Types
+As part of the toolkit, we provide three reward signal types as part of hyperparameters - Extrinsic, Curiosity, and GAIL.
 
-### The Extrinsic Reward Signal
+### Extrinsic Reward Signal
 
 The `extrinsic` reward signal is simply the reward given by the
 [environment](Learning-Environment-Design.md). Remove it to force the agent
@@ -68,9 +69,9 @@ cases when rewards are more immediate, it can be smaller.
 
 Typical Range: `0.8` - `0.995`
 
-### The Curiosity Reward Signal
+### Curiosity Reward Signal
 
-The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an implementation 
+The `curiosity` reward signal enables the Intrinsic Curiosity Module. This is an implementation 
 of the approach described in "Curiosity-driven Exploration by Self-supervised Prediction" 
 by Pathak, et al. It trains two networks:
 * an inverse model, which takes the current and next obersvation of the agent, encodes them, and
@@ -120,12 +121,12 @@ Default Value: `3e-4`
 
 Typical Range: `1e-5` - `1e-3`  
 
-### The GAIL Reward Signal
+### GAIL Reward Signal
 
 GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an 
 imitation learning algorithm that uses an adversarial approach, in a similar vein to GANs 
 (Generative Adversarial Networks). In this framework, a second neural network, the
-discriminator, is taught to distinguish whether an observation/action is from a demonstration, or 
+discriminator, is taught to distinguish whether an observation/action is from a demonstration or 
 produced by the agent. This discriminator can the examine a new observation/action and provide it a 
 reward based on how close it believes this new observation/action is to the provided demonstrations. 
 
@@ -136,9 +137,9 @@ discriminator keeps getting stricter and stricter and the agent must try harder
 
 This approach, when compared to [Behavioral Cloning](Training-BehavioralCloning.md), requires 
 far fewer demonstrations to be provided. After all, we are still learning a policy that happens
-to be similar to the demonstration, not directly copying the behavior of the demonstrations. It
-is also especially effective when combined with an Extrinsic signal, but can also be used 
-independently to purely learn from demonstration. 
+to be similar to the demonstrations, not directly copying the behavior of the demonstrations. It
+is especially effective when combined with an Extrinsic signal. However, the GAIL reward signal can 
+also be used independently to purely learn from demonstrations. 
 
 Using GAIL requires recorded demonstrations from your Unity environment. See the 
 [imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations.

From ac4e629d552aa8b960ab7cc18973af1b64ba7da3 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 11:37:15 -0700
Subject: [PATCH 42/49] referencing GAIL in the recording demonstration

---
 docs/Training-Imitation-Learning.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Training-Imitation-Learning.md b/docs/Training-Imitation-Learning.md
index 7c41eac733..7838ca7b17 100644
--- a/docs/Training-Imitation-Learning.md
+++ b/docs/Training-Imitation-Learning.md
@@ -60,7 +60,7 @@ It is possible to record demonstrations of agent behavior from the Unity Editor,
 and save them as assets. These demonstrations contain information on the
 observations, actions, and rewards for a given agent during the recording session.
 They can be managed from the Editor, as well as used for training with Offline
-Behavioral Cloning (see below).
+Behavioral Cloning and GAIL.
 
 In order to record demonstrations from an agent, add the `Demonstration Recorder`
 component to a GameObject in the scene which contains an `Agent` component.

From 59b8bc6cb603646685f3e5ef518bef0517ef43d9 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 13:05:37 -0700
Subject: [PATCH 43/49] consistency

---
 docs/Learning-Environment-Examples.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/Learning-Environment-Examples.md b/docs/Learning-Environment-Examples.md
index f30f9d2aa8..4930ecf291 100644
--- a/docs/Learning-Environment-Examples.md
+++ b/docs/Learning-Environment-Examples.md
@@ -194,8 +194,8 @@ If you would like to contribute environments, please see our
     * Rotation (3 possible actions: Rotate Left, Rotate Right, No Action)
     * Side Motion (3 possible actions: Left, Right, No Action)
     * Jump (2 possible actions: Jump, No Action)
-  * Visual Observations: None.
-* Reset Parameters: Four.
+  * Visual Observations: None
+* Reset Parameters: Four
 * Benchmark Mean Reward (Big & Small Wall Brain): 0.8
 
 ## [Reacher](https://youtu.be/2N9EoF6pQyE)

From 824490d2260f7e4eb595b16305030a3283597124 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 14:40:00 -0700
Subject: [PATCH 44/49] fixed desc of bc and gail

---
 docs/ML-Agents-Overview.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/ML-Agents-Overview.md b/docs/ML-Agents-Overview.md
index 2ac2b14c37..46cb44d0cd 100644
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
@@ -319,11 +319,11 @@ imitation learning algorithm will then use these pairs of observations and
 actions from the human player to learn a policy. [Video
 Link](https://youtu.be/kpb8ZkMBFYs).
 
-The toolkit provides a way for agents to learn directly from demonstrations using
-the Behavioral Cloning algorithm. The toolkit also enables use of demonstrations
-to help speed up reward-based (RL) training using the Generative Adversarial
-Imitation Learning (GAIL) algorithm. The [Training with Imitation Learning](Training-Imitation-Learning.md)
-tutorial covers these features in more depth.
+The toolkit provides a way to learn directly from demonstrations but also use these
+demonstrations to help speed up reward-based training (RL).  We include two algorithms called
+Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL). The
+[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial covers these
+features in more depth.
 
 ## Flexible Training Scenarios
 

From 9c676c027b13fd54dce6ec4c01fdf01c0d621407 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 15:10:36 -0700
Subject: [PATCH 45/49] comment fix

---
 docs/Training-Imitation-Learning.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/Training-Imitation-Learning.md b/docs/Training-Imitation-Learning.md
index 7838ca7b17..aab4d8f524 100644
--- a/docs/Training-Imitation-Learning.md
+++ b/docs/Training-Imitation-Learning.md
@@ -18,7 +18,8 @@ it is easier to show the agent how to achieve the reward. In these cases,
 imitation learning combined with reinforcement learning can dramatically 
 reduce the time the agent takes to solve the environment.
 For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
-using 6 episodes of demonstrations (recorded from a human player) can reduce training steps by more than 4 times. See PreTraining + GAIL + Curiosity + RL below.
+using 6 episodes of demonstrations can reduce training steps by more than 4 times.
+See PreTraining + GAIL + Curiosity + RL below.
 
 <p align="center">
   <img src="images/mlagents-ImitationAndRL.png"

From dfaf876cf5e51d956cb59f80c13babfc885d86c4 Mon Sep 17 00:00:00 2001
From: Jeffrey Shih <34355042+unityjeffrey@users.noreply.github.com>
Date: Tue, 30 Jul 2019 15:12:12 -0700
Subject: [PATCH 46/49] comments fix

---
 docs/Training-PPO.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/Training-PPO.md b/docs/Training-PPO.md
index e3209949d6..ad62a72ec3 100644
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md
@@ -43,9 +43,10 @@ the agent for exploring new states, rather than just when an explicit reward is
 Furthermore, we could mix reward signals to help the learning process.
 
 Using `reward_signals` allows you to define [reward signals.](Reward-Signals.md)
-The ML-Agents toolkit provides two reward signals by default, the Extrinsic (environment) reward and the
-Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward
-environments. Please see [Reward Signals](Reward-Signals.md) for additional details..
+The ML-Agents toolkit provides three reward signals by default, the Extrinsic (environment)
+reward signal, the Curiosity reward signal, which can be used to encourage exploration in 
+sparse extrinsic reward environments, and the GAIL reward signal. Please see [Reward Signals](Reward-Signals.md)
+for additional details.
 
 ### Lambda
 

From ae6ca4256f2c077dc68f4cd2d99874ef3f2d3e89 Mon Sep 17 00:00:00 2001
From: Ervin Teng <ervin@unity3d.com>
Date: Tue, 30 Jul 2019 17:37:20 -0700
Subject: [PATCH 47/49] Fix broken links

---
 docs/Migrating.md                   |  2 +-
 docs/Reward-Signals.md              | 80 ++++++++++++++---------------
 docs/Training-Imitation-Learning.md |  8 +--
 docs/Training-ML-Agents.md          | 10 ++--
 docs/Training-PPO.md                | 24 ++++-----
 5 files changed, 62 insertions(+), 62 deletions(-)

diff --git a/docs/Migrating.md b/docs/Migrating.md
index 244e803afc..9ab3dafce6 100644
--- a/docs/Migrating.md
+++ b/docs/Migrating.md
@@ -16,7 +16,7 @@ from the top-level config and follow the steps below:
   * `use_curiosity`, `curiosity_strength`, `curiosity_enc_size`: Define a `curiosity` reward signal
   and set its `strength` to `curiosity_strength`, and `encoding_size` to `curiosity_enc_size`. Give it
   the same `gamma` as your `extrinsic` signal to mimic previous behavior.
-See [Reward Signals](Training-RewardSignals.md) for more information on defining reward signals.
+See [Reward Signals](Reward-Signals.md) for more information on defining reward signals.
 * TensorBoards generated when running multiple environments in v0.8 are not comparable to those generated in
 v0.9 in terms of step count. Multiply your v0.8 step count by `num_envs` for an approximate comparison.
 You may need to change `max_steps` in your config as appropriate as well.
diff --git a/docs/Reward-Signals.md b/docs/Reward-Signals.md
index b6d1f55b52..5e1e65a010 100644
--- a/docs/Reward-Signals.md
+++ b/docs/Reward-Signals.md
@@ -71,8 +71,8 @@ Typical Range: `0.8` - `0.995`
 
 ### Curiosity Reward Signal
 
-The `curiosity` reward signal enables the Intrinsic Curiosity Module. This is an implementation 
-of the approach described in "Curiosity-driven Exploration by Self-supervised Prediction" 
+The `curiosity` reward signal enables the Intrinsic Curiosity Module. This is an implementation
+of the approach described in "Curiosity-driven Exploration by Self-supervised Prediction"
 by Pathak, et al. It trains two networks:
 * an inverse model, which takes the current and next obersvation of the agent, encodes them, and
 uses the encoding to predict the action that was taken between the observations
@@ -86,11 +86,11 @@ For more information, see
 * https://pathak22.github.io/noreward-rl/
 * https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/
 
-#### Strength 
+#### Strength
 
-In this case, `strength` corresponds to the magnitude of the curiosity reward generated 
-by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough 
-to not be overwhelmed by extrinsic reward signals in the environment. 
+In this case, `strength` corresponds to the magnitude of the curiosity reward generated
+by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough
+to not be overwhelmed by extrinsic reward signals in the environment.
 Likewise it should not be too large to overwhelm the extrinsic reward signal.
 
 Typical Range: `0.001` - `0.1`
@@ -114,48 +114,48 @@ Typical Range: `64` - `256`
 
 #### Learning Rate
 
-`learning_rate` is the learning rate used to update the intrinsic curiosity module. 
+`learning_rate` is the learning rate used to update the intrinsic curiosity module.
 This should typically be decreased if training is unstable, and the curiosity loss is unstable.
 
 Default Value: `3e-4`
 
-Typical Range: `1e-5` - `1e-3`  
+Typical Range: `1e-5` - `1e-3`
 
 ### GAIL Reward Signal
 
-GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an 
-imitation learning algorithm that uses an adversarial approach, in a similar vein to GANs 
+GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an
+imitation learning algorithm that uses an adversarial approach, in a similar vein to GANs
 (Generative Adversarial Networks). In this framework, a second neural network, the
-discriminator, is taught to distinguish whether an observation/action is from a demonstration or 
-produced by the agent. This discriminator can the examine a new observation/action and provide it a 
-reward based on how close it believes this new observation/action is to the provided demonstrations. 
+discriminator, is taught to distinguish whether an observation/action is from a demonstration or
+produced by the agent. This discriminator can the examine a new observation/action and provide it a
+reward based on how close it believes this new observation/action is to the provided demonstrations.
 
-At each training step, the agent tries to learn how to maximize this reward. Then, the 
-discriminator is trained to better distinguish between demonstrations and agent state/actions. 
+At each training step, the agent tries to learn how to maximize this reward. Then, the
+discriminator is trained to better distinguish between demonstrations and agent state/actions.
 In this way, while the agent gets better and better at mimicing the demonstrations, the
-discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it. 
+discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it.
 
-This approach, when compared to [Behavioral Cloning](Training-BehavioralCloning.md), requires 
+This approach, when compared to [Behavioral Cloning](Training-Behavioral-Cloning.md), requires
 far fewer demonstrations to be provided. After all, we are still learning a policy that happens
 to be similar to the demonstrations, not directly copying the behavior of the demonstrations. It
-is especially effective when combined with an Extrinsic signal. However, the GAIL reward signal can 
-also be used independently to purely learn from demonstrations. 
+is especially effective when combined with an Extrinsic signal. However, the GAIL reward signal can
+also be used independently to purely learn from demonstrations.
 
-Using GAIL requires recorded demonstrations from your Unity environment. See the 
+Using GAIL requires recorded demonstrations from your Unity environment. See the
 [imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations.
 
-#### Strength 
+#### Strength
 
 `strength` is the factor by which to multiply the raw reward. Note that when using GAIL
-with an Extrinsic Signal, this value should be set lower if your demonstrations are 
-suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic 
-rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases. 
+with an Extrinsic Signal, this value should be set lower if your demonstrations are
+suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic
+rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases.
 
 Typical Range: `0.01` - `1.0`
 
 #### Gamma
 
-`gamma` corresponds to the discount factor for future rewards. 
+`gamma` corresponds to the discount factor for future rewards.
 
 Typical Range: `0.8` - `0.9`
 
@@ -166,11 +166,11 @@ Typical Range: `0.8` - `0.9`
 
 #### Encoding Size
 
-`encoding_size` corresponds to the size of the hidden layer used by the discriminator. 
+`encoding_size` corresponds to the size of the hidden layer used by the discriminator.
 This value should be small enough to encourage the discriminator to compress the original
-observation, but also not too small to prevent it from learning to differentiate between 
+observation, but also not too small to prevent it from learning to differentiate between
 demonstrated and actual behavior. Dramatically increasing this size will also negatively affect
-training times. 
+training times.
 
 Default Value: `64`
 
@@ -178,29 +178,29 @@ Typical Range: `64` - `256`
 
 #### Learning Rate
 
-`learning_rate` is the learning rate used to update the discriminator. 
+`learning_rate` is the learning rate used to update the discriminator.
 This should typically be decreased if training is unstable, and the GAIL loss is unstable.
 
 Default Value: `3e-4`
 
-Typical Range: `1e-5` - `1e-3`  
+Typical Range: `1e-5` - `1e-3`
 
 #### Use Actions
 
-`use_actions` determines whether the discriminator should discriminate based on both 
+`use_actions` determines whether the discriminator should discriminate based on both
 observations and actions, or just observations. Set to `True` if you want the agent to
 mimic the actions from the demonstrations, and `False` if you'd rather have the agent
-visit the same states as in the demonstrations but with possibly different actions. 
+visit the same states as in the demonstrations but with possibly different actions.
 Setting to `False` is more likely to be stable, especially with imperfect demonstrations,
-but may learn slower. 
+but may learn slower.
 
 Default Value: `false`
 
 #### (Optional) Samples Per Update
 
-`samples_per_update` is the maximum number of samples to use during each discriminator update. You may 
-want to lower this if your buffer size is very large to avoid overfitting the discriminator on current data. 
-If set to 0, we will use the minimum of buffer size and the number of demonstration samples. 
+`samples_per_update` is the maximum number of samples to use during each discriminator update. You may
+want to lower this if your buffer size is very large to avoid overfitting the discriminator on current data.
+If set to 0, we will use the minimum of buffer size and the number of demonstration samples.
 
 Default Value: `0`
 
@@ -208,10 +208,10 @@ Typical Range: Approximately equal to [`buffer_size`](Training-PPO.md)
 
 #### (Optional) Variational Discriminator Bottleneck
 
-`use_vail` enables a [variational bottleneck](https://arxiv.org/abs/1810.00821) within the 
-GAIL discriminator. This forces the discriminator to learn a more general representation 
-and reduces its tendency to be "too good" at discriminating, making learning more stable. 
+`use_vail` enables a [variational bottleneck](https://arxiv.org/abs/1810.00821) within the
+GAIL discriminator. This forces the discriminator to learn a more general representation
+and reduces its tendency to be "too good" at discriminating, making learning more stable.
 However, it does increase training time. Enable this if you notice your imitation learning is
-unstable, or unable to learn the task at hand. 
+unstable, or unable to learn the task at hand.
 
 Default Value: `false`
diff --git a/docs/Training-Imitation-Learning.md b/docs/Training-Imitation-Learning.md
index aab4d8f524..679568a339 100644
--- a/docs/Training-Imitation-Learning.md
+++ b/docs/Training-Imitation-Learning.md
@@ -1,4 +1,4 @@
-# Imitation Learning
+# Training with Imitation Learning
 
 It is often more intuitive to simply demonstrate the behavior we want an agent
 to perform, rather than attempting to have it learn via trial-and-error methods.
@@ -12,10 +12,10 @@ from a demonstration to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFY
 
 Imitation learning can also be used to help reinforcement learning. Especially in
 environments with sparse (i.e., infrequent or rare) rewards, the agent may never see
-the reward and thus not learn from it. Curiosity (which is available in the toolkit) 
+the reward and thus not learn from it. Curiosity (which is available in the toolkit)
 helps the agent explore, but in some cases
 it is easier to show the agent how to achieve the reward. In these cases,
-imitation learning combined with reinforcement learning can dramatically 
+imitation learning combined with reinforcement learning can dramatically
 reduce the time the agent takes to solve the environment.
 For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
 using 6 episodes of demonstrations can reduce training steps by more than 4 times.
@@ -30,7 +30,7 @@ See PreTraining + GAIL + Curiosity + RL below.
 The ML-Agents toolkit provides several ways to learn from demonstrations.
 
 * To train using GAIL (Generative Adversarial Imitaiton Learning) you can add the
-  [GAIL reward signal](Training-RewardSignals.md#the-gail-reward-signal). GAIL can be
+  [GAIL reward signal](Reward-Signals.md#the-gail-reward-signal). GAIL can be
   used with or without environment rewards, and works well when there are a limited
   number of demonstrations.
 * To help bootstrap reinforcement learning, you can enable
diff --git a/docs/Training-ML-Agents.md b/docs/Training-ML-Agents.md
index 86c483cb1a..5a9a749119 100644
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
@@ -103,7 +103,7 @@ environment, you can set the following command line options when invoking
   Training](Training-Curriculum-Learning.md) for more information.
 * `--sampler=<file>`: Specify a sampler YAML file for defining the
   sampler for generalization training. See [Generalization
-  Training](Training-Generalization-Learning.md) for more information.
+  Training](Training-Generalized-Reinforcement-Learning-Agents.md) for more information.
 * `--keep-checkpoints=<n>`: Specify the maximum number of model checkpoints to
   keep. Checkpoints are saved after the number of steps specified by the
   `save-freq` option. Once the maximum number of checkpoints has been reached,
@@ -180,7 +180,7 @@ environments are included in the provided config file.
 | num_epoch            | The number of passes to make through the experience buffer when performing gradient descent optimization.                                                                               | PPO                      |
 | num_layers           | The number of hidden layers in the neural network.                                                                                                                                      | PPO, BC                  |
 | pretraining          | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations).                                                                                            | PPO                      |
-| reward_signals       | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Training-RewardSignals.md) for configuration options.                                                                                            | PPO                      |
+| reward_signals       | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options.                                                                                            | PPO                      |
 | sequence_length      | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, BC                  |
 | summary_freq         | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard.                                                                       | PPO, BC                  |
 | time_horizon         | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, (online)BC          |
@@ -206,15 +206,15 @@ example to see how the hyperparameters and other configuration variables have
 been changed from the defaults.
 
 ### Debugging and Profiling
-If you enable the `--debug` flag in the command line, the trainer metrics are logged to a CSV file 
+If you enable the `--debug` flag in the command line, the trainer metrics are logged to a CSV file
 stored in the `summaries` directory. The metrics stored are:
   * brain name
   * time to update policy
   * time since start of training
   * time for last experience collection
   * number of experiences used for training
-  * mean return 
-  
+  * mean return
+
 This option is not available currently for Behavioral Cloning.
 
 Additionally, we have included basic [Profiling in Python](Profiling-Python.md) as part of the toolkit.
diff --git a/docs/Training-PPO.md b/docs/Training-PPO.md
index ad62a72ec3..a2bd53844b 100644
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md
@@ -44,7 +44,7 @@ Furthermore, we could mix reward signals to help the learning process.
 
 Using `reward_signals` allows you to define [reward signals.](Reward-Signals.md)
 The ML-Agents toolkit provides three reward signals by default, the Extrinsic (environment)
-reward signal, the Curiosity reward signal, which can be used to encourage exploration in 
+reward signal, the Curiosity reward signal, which can be used to encourage exploration in
 sparse extrinsic reward environments, and the GAIL reward signal. Please see [Reward Signals](Reward-Signals.md)
 for additional details.
 
@@ -172,10 +172,10 @@ Typical Range: `32` - `512`
 `vis_encode_type` corresponds to the encoder type for encoding visual observations.
 Valid options include:
 * `simple` (default): a simple encoder which consists of two convolutional layers
-* `nature_cnn`: CNN implementation proposed by Mnih et al.(https://www.nature.com/articles/nature14236), 
+* `nature_cnn`: CNN implementation proposed by Mnih et al.(https://www.nature.com/articles/nature14236),
 consisting of three convolutional layers
 * `resnet`: IMPALA Resnet implementation (https://arxiv.org/abs/1802.01561),
-consisting of three stacked layers, each with two risidual blocks, making a 
+consisting of three stacked layers, each with two risidual blocks, making a
 much larger network than the other two.
 
 Options: `simple`, `nature_cnn`, `resnet`
@@ -207,9 +207,9 @@ Typical Range: `64` - `512`
 ## (Optional) Pretraining Using Demonstrations
 
 In some cases, you might want to bootstrap the agent's policy using behavior recorded
-from a player. This can help guide the agent towards the reward. Pretraining adds 
-training operations that mimic a demonstration rather than attempting to maximize reward. 
-It is essentially equivalent to running [behavioral cloning](./Training-BehavioralCloning.md)
+from a player. This can help guide the agent towards the reward. Pretraining adds
+training operations that mimic a demonstration rather than attempting to maximize reward.
+It is essentially equivalent to running [behavioral cloning](Training-Behavioral-Cloning.md)
 in-line with PPO.
 
 To use pretraining, add a `pretraining` section to the trainer_config. For instance:
@@ -227,22 +227,22 @@ Below are the avaliable hyperparameters for pretraining.
 
 `strength` corresponds to the learning rate of the imitation relative to the learning
 rate of PPO, and roughly corresponds to how strongly we allow the behavioral cloning
-to influence the policy. 
+to influence the policy.
 
 Typical Range: `0.1` - `0.5`
 
 ### Demo Path
 
-`demo_path` is the path to your `.demo` file or directory of `.demo` files. 
+`demo_path` is the path to your `.demo` file or directory of `.demo` files.
 See the [imitation learning guide](Training-Imitation-Learning.md) for more on `.demo` files.
 
 ### Steps
 
-During pretraining, it is often desirable to stop using demonstrations after the agent has 
+During pretraining, it is often desirable to stop using demonstrations after the agent has
 "seen" rewards, and allow it to optimize past the available demonstrations and/or generalize
 outside of the provided demonstrations. `steps` corresponds to the training steps over which
-pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set 
-the steps to 0 for constant imitation over the entire training run. 
+pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
+the steps to 0 for constant imitation over the entire training run.
 
 ### (Optional) Batch Size
 
@@ -264,7 +264,7 @@ Typical Range: `3` - `10`
 
 `samples_per_update` is the maximum number of samples
 to use during each imitation update. You may want to lower this if your demonstration
-dataset is very large to avoid overfitting the policy on demonstrations. Set to 0 
+dataset is very large to avoid overfitting the policy on demonstrations. Set to 0
 to train over all of the demonstrations at each update step.
 
 Default Value: `0` (all)

From 82871c4aae5f12140ca058be00de032086193ce7 Mon Sep 17 00:00:00 2001
From: Ervin Teng <ervin@unity3d.com>
Date: Tue, 30 Jul 2019 17:38:41 -0700
Subject: [PATCH 48/49] Fix grammar in Overview for IL

---
 docs/ML-Agents-Overview.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/ML-Agents-Overview.md b/docs/ML-Agents-Overview.md
index 46cb44d0cd..f194c64ef0 100644
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
@@ -319,8 +319,8 @@ imitation learning algorithm will then use these pairs of observations and
 actions from the human player to learn a policy. [Video
 Link](https://youtu.be/kpb8ZkMBFYs).
 
-The toolkit provides a way to learn directly from demonstrations but also use these
-demonstrations to help speed up reward-based training (RL).  We include two algorithms called
+The toolkit provides a way to learn directly from demonstrations, as well as use them
+to help speed up reward-based training (RL).  We include two algorithms called
 Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL). The
 [Training with Imitation Learning](Training-Imitation-Learning.md) tutorial covers these
 features in more depth.
@@ -415,7 +415,7 @@ training process.
   a way to randomly sample Reset Parameters of the environment during training. See
   [Training Generalized Reinforcement Learning Agents](Training-Generalized-Reinforcement-Learning-Agents.md)
   to learn more about this feature.
-  
+
 - **Broadcasting** - As discussed earlier, a Learning Brain sends the
   observations for all its Agents to the Python API when dragged into the
   Academy's `Broadcast Hub` with the `Control` checkbox checked. This is helpful

From 9dd5f181291ed72f17ad11e2e5b0bda95bcb970f Mon Sep 17 00:00:00 2001
From: Ervin Teng <ervin@unity3d.com>
Date: Tue, 30 Jul 2019 10:54:10 -0700
Subject: [PATCH 49/49] Add optional params to reward signals

comment to GAIL
---
 docs/Reward-Signals.md                        | 43 +++++++++++++------
 .../components/reward_signals/gail/signal.py  |  1 +
 2 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/docs/Reward-Signals.md b/docs/Reward-Signals.md
index 5e1e65a010..0b44185766 100644
--- a/docs/Reward-Signals.md
+++ b/docs/Reward-Signals.md
@@ -101,7 +101,7 @@ Typical Range: `0.001` - `0.1`
 
 Typical Range: `0.8` - `0.995`
 
-#### Encoding Size
+#### (Optional) Encoding Size
 
 `encoding_size` corresponds to the size of the encoding used by the intrinsic curiosity model.
 This value should be small enough to encourage the ICM to compress the original
@@ -112,7 +112,7 @@ Default Value: `64`
 
 Typical Range: `64` - `256`
 
-#### Learning Rate
+#### (Optional) Learning Rate
 
 `learning_rate` is the learning rate used to update the intrinsic curiosity module.
 This should typically be decreased if training is unstable, and the curiosity loss is unstable.
@@ -121,6 +121,15 @@ Default Value: `3e-4`
 
 Typical Range: `1e-5` - `1e-3`
 
+#### (Optional) Num Epochs
+
+`num_epoch` The number of passes to make through the experience buffer when performing gradient
+descent optimization for the ICM. This typically should be set to the same as used for PPO.
+
+Default Value: `3`
+
+Typical Range: `3` - `10`
+
 ### GAIL Reward Signal
 
 GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an
@@ -164,7 +173,7 @@ Typical Range: `0.8` - `0.9`
 `demo_path` is the path to your `.demo` file or directory of `.demo` files. See the [imitation learning guide]
 (Training-Imitation-Learning.md).
 
-#### Encoding Size
+#### (Optional) Encoding Size
 
 `encoding_size` corresponds to the size of the hidden layer used by the discriminator.
 This value should be small enough to encourage the discriminator to compress the original
@@ -176,7 +185,7 @@ Default Value: `64`
 
 Typical Range: `64` - `256`
 
-#### Learning Rate
+#### (Optional) Learning Rate
 
 `learning_rate` is the learning rate used to update the discriminator.
 This should typically be decreased if training is unstable, and the GAIL loss is unstable.
@@ -185,7 +194,7 @@ Default Value: `3e-4`
 
 Typical Range: `1e-5` - `1e-3`
 
-#### Use Actions
+#### (Optional) Use Actions
 
 `use_actions` determines whether the discriminator should discriminate based on both
 observations and actions, or just observations. Set to `True` if you want the agent to
@@ -196,6 +205,16 @@ but may learn slower.
 
 Default Value: `false`
 
+#### (Optional) Variational Discriminator Bottleneck
+
+`use_vail` enables a [variational bottleneck](https://arxiv.org/abs/1810.00821) within the
+GAIL discriminator. This forces the discriminator to learn a more general representation
+and reduces its tendency to be "too good" at discriminating, making learning more stable.
+However, it does increase training time. Enable this if you notice your imitation learning is
+unstable, or unable to learn the task at hand.
+
+Default Value: `false`
+
 #### (Optional) Samples Per Update
 
 `samples_per_update` is the maximum number of samples to use during each discriminator update. You may
@@ -206,12 +225,12 @@ Default Value: `0`
 
 Typical Range: Approximately equal to [`buffer_size`](Training-PPO.md)
 
-#### (Optional) Variational Discriminator Bottleneck
+#### (Optional) Num Epochs
 
-`use_vail` enables a [variational bottleneck](https://arxiv.org/abs/1810.00821) within the
-GAIL discriminator. This forces the discriminator to learn a more general representation
-and reduces its tendency to be "too good" at discriminating, making learning more stable.
-However, it does increase training time. Enable this if you notice your imitation learning is
-unstable, or unable to learn the task at hand.
+`num_epoch` The number of passes to make through the experience buffer when performing gradient
+descent optimization for the discriminator. To avoid overfitting, this typically should be set to
+the same as or less than used for PPO.
 
-Default Value: `false`
+Default Value: `3`
+
+Typical Range: `1` - `10`
\ No newline at end of file
diff --git a/ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py b/ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py
index 891f12a69a..a7c923ac4d 100644
--- a/ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py
+++ b/ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py
@@ -34,6 +34,7 @@ def __init__(
         reward multiplied by the strength parameter
         :param gamma: The time discounting factor used for this reward.
         :param demo_path: The path to the demonstration file
+        :param num_epoch: The number of epochs to train over the training buffer for the discriminator.
         :param encoding_size: The size of the the hidden layers of the discriminator
         :param learning_rate: The Learning Rate used during GAIL updates.
         :param samples_per_update: The maximum number of samples to update during GAIL updates.