Add Soft Actor-Critic as trainer option (#2341)

* Add Soft Actor-Critic model, trainer, and policy and sac_trainer_config.yaml * Add documentation for SAC and tweak PPO documentation to reference the new pages. * Add tests for SAC, change simple_rl test to run both PPO and SAC.
Unity-Technologies · Sep 6, 2019 · 5cd2118 · 5cd2118
1 parent d150c51
commit 5cd2118
Show file tree

Hide file tree

Showing 17 changed files with 3,064 additions and 133 deletions.
diff --git a/README.md b/README.md
@@ -25,6 +25,7 @@ developer communities.
 
 * Unity environment control from Python
 * 10+ sample Unity environments
+* Two deep reinforcement learning algorithms, [Proximal Policy Optimization](docs/Training-PPO.md) (PPO) and [Soft Actor-Critic](docs/Training-SAC.md) (SAC)
 * Support for multiple environment configurations and training scenarios
 * Train memory-enhanced agents using deep reinforcement learning
 * Easily definable Curriculum Learning and Generalization scenarios

diff --git a/config/sac_trainer_config.yaml b/config/sac_trainer_config.yaml
@@ -0,0 +1,276 @@
+default:
+    trainer: sac
+    batch_size: 128
+    buffer_size: 50000
+    buffer_init_steps: 0
+    hidden_units: 128
+    init_entcoef: 1.0
+    learning_rate: 3.0e-4
+    max_steps: 5.0e4
+    memory_size: 256
+    normalize: false
+    num_update: 1
+    train_interval: 1
+    num_layers: 2
+    time_horizon: 64
+    sequence_length: 64
+    summary_freq: 1000
+    tau: 0.005
+    use_recurrent: false
+    vis_encode_type: default
+    reward_signals:
+        extrinsic:
+            strength: 1.0
+            gamma: 0.99
+
+BananaLearning:
+    normalize: false
+    batch_size: 256
+    buffer_size: 500000
+    max_steps: 1.0e5
+    init_entcoef: 0.05
+    train_interval: 1
+
+VisualBananaLearning:
+    beta: 1.0e-2
+    gamma: 0.99
+    num_epoch: 1
+    max_steps: 5.0e5
+    summary_freq: 1000
+
+BouncerLearning:
+    normalize: true
+    beta: 0.0
+    max_steps: 5.0e5
+    num_layers: 2
+    hidden_units: 64
+    summary_freq: 1000
+
+PushBlockLearning:
+    max_steps: 5.0e4
+    init_entcoef: 0.05
+    beta: 1.0e-2
+    hidden_units: 256
+    summary_freq: 2000
+    time_horizon: 64
+    num_layers: 2
+
+SmallWallJumpLearning:
+    max_steps: 1.0e6
+    hidden_units: 256
+    summary_freq: 2000
+    time_horizon: 128
+    init_entcoef: 0.1
+    num_layers: 2
+    normalize: false
+
+BigWallJumpLearning:
+    max_steps: 1.0e6
+    hidden_units: 256
+    summary_freq: 2000
+    time_horizon: 128
+    num_layers: 2
+    init_entcoef: 0.1
+    normalize: false
+
+StrikerLearning:
+    max_steps: 5.0e5
+    learning_rate: 1e-3
+    beta: 1.0e-2
+    hidden_units: 256
+    summary_freq: 2000
+    time_horizon: 128
+    init_entcoef: 0.1
+    num_layers: 2
+    normalize: false
+
+GoalieLearning:
+    max_steps: 5.0e5
+    learning_rate: 1e-3
+    beta: 1.0e-2
+    hidden_units: 256
+    summary_freq: 2000
+    time_horizon: 128
+    init_entcoef: 0.1
+    num_layers: 2
+    normalize: false
+
+PyramidsLearning:
+    summary_freq: 2000
+    time_horizon: 128
+    batch_size: 128
+    buffer_init_steps: 10000
+    buffer_size: 500000
+    hidden_units: 256
+    num_layers: 2
+    init_entcoef: 0.01
+    max_steps: 5.0e5
+    sequence_length: 16
+    tau: 0.01
+    use_recurrent: false
+    reward_signals:
+        extrinsic:
+            strength: 2.0
+            gamma: 0.99
+        gail:
+            strength: 0.02
+            gamma: 0.99
+            encoding_size: 128
+            use_actions: true
+            demo_path: demos/ExpertPyramid.demo
+
+VisualPyramidsLearning:
+    time_horizon: 128
+    batch_size: 64
+    hidden_units: 256
+    buffer_init_steps: 1000
+    num_layers: 1
+    beta: 1.0e-2
+    max_steps: 5.0e5
+    buffer_size: 500000
+    init_entcoef: 0.01
+    tau: 0.01
+    reward_signals:
+        extrinsic:
+            strength: 2.0
+            gamma: 0.99
+        gail:
+            strength: 0.02
+            gamma: 0.99
+            encoding_size: 128
+            use_actions: true
+            demo_path: demos/ExpertPyramid.demo
+
+3DBallLearning:
+    normalize: true
+    batch_size: 64
+    buffer_size: 12000
+    summary_freq: 1000
+    time_horizon: 1000
+    hidden_units: 64
+    init_entcoef: 0.5
+    max_steps: 5.0e5
+
+3DBallHardLearning:
+    normalize: true
+    batch_size: 256
+    summary_freq: 1000
+    time_horizon: 1000
+    max_steps: 5.0e5
+
+TennisLearning:
+    normalize: true
+    max_steps: 2e5
+
+CrawlerStaticLearning:
+    normalize: true
+    time_horizon: 1000
+    batch_size: 256
+    train_interval: 3
+    buffer_size: 500000
+    buffer_init_steps: 2000
+    max_steps: 5e5
+    summary_freq: 3000
+    init_entcoef: 1.0
+    num_layers: 3
+    hidden_units: 512
+
+CrawlerDynamicLearning:
+    normalize: true
+    time_horizon: 1000
+    batch_size: 256
+    buffer_size: 500000
+    summary_freq: 3000
+    train_interval: 3
+    num_layers: 3
+    max_steps: 5e5
+    hidden_units: 512
+
+WalkerLearning:
+    normalize: true
+    time_horizon: 1000
+    batch_size: 256
+    buffer_size: 500000
+    max_steps: 2e6
+    summary_freq: 3000
+    num_layers: 3
+    train_interval: 3
+    hidden_units: 512
+    reward_signals:
+        extrinsic:
+            strength: 1.0
+            gamma: 0.995
+
+ReacherLearning:
+    normalize: true
+    time_horizon: 1000
+    batch_size: 128
+    buffer_size: 500000
+    max_steps: 1e6
+    summary_freq: 3000
+
+HallwayLearning:
+    use_recurrent: true
+    sequence_length: 32
+    num_layers: 2
+    hidden_units: 128
+    memory_size: 256
+    beta: 0.0
+    init_entcoef: 0.1
+    max_steps: 5.0e5
+    summary_freq: 1000
+    time_horizon: 64
+    use_recurrent: true
+
+VisualHallwayLearning:
+    use_recurrent: true
+    sequence_length: 32
+    num_layers: 1
+    hidden_units: 128
+    memory_size: 256
+    beta: 1.0e-2
+    gamma: 0.99
+    batch_size: 64
+    max_steps: 5.0e5
+    summary_freq: 1000
+    time_horizon: 64
+    use_recurrent: true
+
+VisualPushBlockLearning:
+    use_recurrent: true
+    sequence_length: 32
+    num_layers: 1
+    hidden_units: 128
+    memory_size: 256
+    beta: 1.0e-2
+    gamma: 0.99
+    buffer_size: 1024
+    batch_size: 64
+    max_steps: 5.0e5
+    summary_freq: 1000
+    time_horizon: 64
+
+GridWorldLearning:
+    batch_size: 128
+    normalize: false
+    num_layers: 1
+    hidden_units: 128
+    init_entcoef: 0.01
+    buffer_size: 50000
+    max_steps: 5.0e5
+    summary_freq: 2000
+    time_horizon: 5
+    reward_signals:
+        extrinsic:
+            strength: 1.0
+            gamma: 0.9
+
+BasicLearning:
+    batch_size: 64
+    normalize: false
+    num_layers: 2
+    init_entcoef: 0.01
+    hidden_units: 20
+    max_steps: 5.0e5
+    summary_freq: 2000
+    time_horizon: 10
diff --git a/docs/Getting-Started-with-Balance-Ball.md b/docs/Getting-Started-with-Balance-Ball.md
@@ -52,16 +52,16 @@ to speed up training since all twelve agents contribute to training in parallel.
 
 The Academy object for the scene is placed on the Ball3DAcademy GameObject. When
 you look at an Academy component in the inspector, you can see several
-properties that control how the environment works. 
+properties that control how the environment works.
 The **Broadcast Hub** keeps track of which Brains will send data during training.
 If a Brain is added to the hub, the data from this Brain will be sent to the external training
 process. If the `Control` checkbox is checked, the training process will be able to
 control and train the agents linked to the Brain.
-The **Training Configuration** and **Inference Configuration** properties 
-set the graphics and timescale properties for the Unity application. 
+The **Training Configuration** and **Inference Configuration** properties
+set the graphics and timescale properties for the Unity application.
 The Academy uses the **Training Configuration**  during training and the
-**Inference Configuration** when not training. (*Inference* means that the 
-Agent is using a trained model or heuristics or direct control — in other 
+**Inference Configuration** when not training. (*Inference* means that the
+Agent is using a trained model or heuristics or direct control — in other
 words, whenever **not** training.)
 Typically, you would set a low graphics quality and timescale to greater `1.0` for the **Training
 Configuration** and a high graphics quality and timescale to `1.0` for the
@@ -94,8 +94,8 @@ returns the chosen action to the Agent. All Agents can share the same
 Brain, but would act independently. The Brain settings tell you quite a bit about how
 an Agent works.
 
-You can create new Brain assets by selecting `Assets -> 
-Create -> ML-Agents -> Brain`. There are 3 types of Brains. 
+You can create new Brain assets by selecting `Assets ->
+Create -> ML-Agents -> Brain`. There are 3 types of Brains.
 The **Learning Brain** is a Brain that uses a trained neural network to make decisions.
 When the `Control` box is checked in the Brains property under the **Broadcast Hub** in the Academy, the external process that is training the neural network will take over decision making for the agents
 and ultimately generate a trained neural network. You can also use the
@@ -184,18 +184,27 @@ The Ball3DAgent subclass defines the following methods:
 
 Now that we have an environment, we can perform the training.
 
-### Training with PPO
+### Training with Deep Reinforcement Learning
 
-In order to train an agent to correctly balance the ball, we will use a
-Reinforcement Learning algorithm called Proximal Policy Optimization (PPO). This
-is a method that has been shown to be safe, efficient, and more general purpose
-than many other RL algorithms, as such we have chosen it as the example
-algorithm for use with ML-Agents toolkit. For more information on PPO, OpenAI
-has a recent [blog post](https://blog.openai.com/openai-baselines-ppo/)
-explaining it.
+In order to train an agent to correctly balance the ball, we provide two
+deep reinforcement learning algorithms.
 
-To train the agents within the Ball Balance environment, we will be using the
-Python package. We have provided a convenient command called `mlagents-learn`
+The default algorithm is Proximal Policy Optimization (PPO). This
+is a method that has been shown to be more general purpose and stable
+than many other RL algorithms. For more information on PPO, OpenAI
+has a [blog post](https://blog.openai.com/openai-baselines-ppo/)
+explaining it, and [our page](Training-PPO.md) for how to use it in training.
+
+We also provide Soft-Actor Critic, an off-policy algorithm that
+has been shown to be both stable and sample-efficient.
+For more information on SAC, see UC Berkeley's
+[blog post](https://bair.berkeley.edu/blog/2018/12/14/sac/) and
+[our page](Training-SAC.md) for more guidance on when to use SAC vs. PPO. To
+use SAC to train Balance Ball, replace all references to `config/trainer_config.yaml`
+with `config/sac_trainer_config.yaml` below.
+
+To train the agents within the Balance Ball environment, we will be using the
+ML-Agents Python package. We have provided a convenient command called `mlagents-learn`
 which accepts arguments used to configure both training and inference phases.
 
 We can use `run_id` to identify the experiment and create a folder where the
@@ -271,9 +280,9 @@ From TensorBoard, you will see the summary statistics:
 Once the training process completes, and the training process saves the model
 (denoted by the `Saved Model` message) you can add it to the Unity project and
 use it with Agents having a **Learning Brain**.
-__Note:__ Do not just close the Unity Window once the `Saved Model` message appears. 
-Either wait for the training process to close the window or press Ctrl+C at the 
-command-line prompt. If you close the window manually, the `.nn` file 
+__Note:__ Do not just close the Unity Window once the `Saved Model` message appears.
+Either wait for the training process to close the window or press Ctrl+C at the
+command-line prompt. If you close the window manually, the `.nn` file
 containing the trained model is not exported into the ml-agents folder.
 
 ### Embedding the trained model into Unity

diff --git a/docs/Readme.md b/docs/Readme.md
@@ -36,6 +36,7 @@
 
 * [Training ML-Agents](Training-ML-Agents.md)
 * [Training with Proximal Policy Optimization](Training-PPO.md)
+* [Training with Soft Actor-Critic](Training-SAC.md)
 * [Training with Curriculum Learning](Training-Curriculum-Learning.md)
 * [Training with Imitation Learning](Training-Imitation-Learning.md)
 * [Training with LSTM](Feature-Memory.md)