diff --git a/docs/ML-Agents-Overview.md b/docs/ML-Agents-Overview.md index 5d727ca6ad..04224be32c 100644 --- a/docs/ML-Agents-Overview.md +++ b/docs/ML-Agents-Overview.md @@ -185,8 +185,8 @@ range of training and inference scenarios: - **Learning** - where decisions are made using an embedded [TensorFlow](Background-TensorFlow.md) model. The embedded TensorFlow model represents a learned policy and the Brain directly uses this model to - determine the action for each Agent. You can train a **Learning Brain** - by dragging it into the Academy's `Broadcast Hub` with the `Control` + determine the action for each Agent. You can train a **Learning Brain** + by dragging it into the Academy's `Broadcast Hub` with the `Control` checkbox checked. - **Player** - where decisions are made using real input from a keyboard or controller. Here, a human player is controlling the Agent and the observations @@ -224,7 +224,7 @@ inference can proceed. As mentioned previously, the ML-Agents toolkit ships with several implementations of state-of-the-art algorithms for training intelligent agents. -In this mode, the only Brain used is a **Learning Brain**. More +In this mode, the only Brain used is a **Learning Brain**. More specifically, during training, all the medics in the scene send their observations to the Python API through the External Communicator (this is the behavior with an External Brain). The Python API @@ -244,7 +244,7 @@ time. To summarize: our built-in implementations are based on TensorFlow, thus, during training the Python API uses the observations it receives to learn a TensorFlow model. This model is then embedded within the Learning Brain during inference to -generate the optimal actions for all Agents linked to that Brain. +generate the optimal actions for all Agents linked to that Brain. The [Getting Started with the 3D Balance Ball Example](Getting-Started-with-Balance-Ball.md) @@ -255,7 +255,7 @@ tutorial covers this training mode with the **3D Balance Ball** sample environme In the previous mode, the Learning Brain was used for training to generate a TensorFlow model that the Learning Brain can later use. However, any user of the ML-Agents toolkit can leverage their own algorithms for -training. In this case, the Brain type would be set to Learning and be linked +training. In this case, the Brain type would be set to Learning and be linked to the BroadcastHub (with checked `Control` checkbox) and the behaviors of all the Agents in the scene will be controlled within Python. You can even turn your environment into a [gym.](../gym-unity/README.md) @@ -319,8 +319,10 @@ imitation learning algorithm will then use these pairs of observations and actions from the human player to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFYs). -The [Training with Imitation Learning](Training-Imitation-Learning.md) tutorial -covers this training mode with the **Banana Collector** sample environment. +ML-Agents provides ways to both learn directly from demonstrations as well as +use demonstrations to help speed up reward-based training. The +[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial +covers these features in more depth. ## Flexible Training Scenarios @@ -408,7 +410,7 @@ training process. - **Broadcasting** - As discussed earlier, a Learning Brain sends the observations for all its Agents to the Python API when dragged into the Academy's `Broadcast Hub` with the `Control` checkbox checked. This is helpful - for training and later inference. Broadcasting is a feature which can be + for training and later inference. Broadcasting is a feature which can be enabled all types of Brains (Player, Learning, Heuristic) where the Agent observations and actions are also sent to the Python API (despite the fact that the Agent is **not** controlled by the Python API). This feature is diff --git a/docs/Training-ML-Agents.md b/docs/Training-ML-Agents.md index da597e584b..97c5bcecc9 100644 --- a/docs/Training-ML-Agents.md +++ b/docs/Training-ML-Agents.md @@ -170,10 +170,7 @@ environments are included in the provided config file. | brain\_to\_imitate | For online imitation learning, the name of the GameObject containing the Brain component to imitate. | (online)BC | | demo_path | For offline imitation learning, the file path of the recorded demonstration file | (offline)BC | | buffer_size | The number of experiences to collect before updating the policy model. | PPO | -| curiosity\_enc\_size | The size of the encoding to use in the forward and inverse models in the Curiosity module. | PPO | -| curiosity_strength | Magnitude of intrinsic reward generated by Intrinsic Curiosity Module. | PPO | | epsilon | Influences how rapidly the policy can evolve during training. | PPO | -| gamma | The reward discount rate for the Generalized Advantage Estimator (GAE). | PPO | | hidden_units | The number of units in the hidden layers of the neural network. | PPO, BC | | lambd | The regularization parameter. | PPO | | learning_rate | The initial learning rate for gradient descent. | PPO, BC | @@ -182,13 +179,15 @@ environments are included in the provided config file. | normalize | Whether to automatically normalize observations. | PPO | | num_epoch | The number of passes to make through the experience buffer when performing gradient descent optimization. | PPO | | num_layers | The number of hidden layers in the neural network. | PPO, BC | +| pretraining | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations). | PPO | +| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Training-RewardSignals.md) for configuration options. | PPO | | sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, BC | | summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, BC | | time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, (online)BC | -| trainer | The type of training to perform: "ppo" or "imitation". | PPO, BC | -| use_curiosity | Train using an additional intrinsic reward signal generated from Intrinsic Curiosity Module. | PPO | +| trainer | The type of training to perform: "ppo", "offline_bc" or "online_bc". | PPO, BC | | use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, BC | + \*PPO = Proximal Policy Optimization, BC = Behavioral Cloning (Imitation) For specific advice on setting hyperparameters based on the type of training you