Skip to content

Commit

Permalink
Add Soft Actor-Critic as trainer option (#2341)
Browse files Browse the repository at this point in the history
* Add Soft Actor-Critic model, trainer, and policy and sac_trainer_config.yaml
* Add documentation for SAC and tweak PPO documentation to reference the new pages.
* Add tests for SAC, change simple_rl test to run both PPO and SAC.
  • Loading branch information
Ervin T committed Sep 6, 2019
1 parent d150c51 commit 5cd2118
Show file tree
Hide file tree
Showing 17 changed files with 3,064 additions and 133 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ developer communities.

* Unity environment control from Python
* 10+ sample Unity environments
* Two deep reinforcement learning algorithms, [Proximal Policy Optimization](docs/Training-PPO.md) (PPO) and [Soft Actor-Critic](docs/Training-SAC.md) (SAC)
* Support for multiple environment configurations and training scenarios
* Train memory-enhanced agents using deep reinforcement learning
* Easily definable Curriculum Learning and Generalization scenarios
Expand Down
276 changes: 276 additions & 0 deletions config/sac_trainer_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
default:
trainer: sac
batch_size: 128
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 128
init_entcoef: 1.0
learning_rate: 3.0e-4
max_steps: 5.0e4
memory_size: 256
normalize: false
num_update: 1
train_interval: 1
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 1000
tau: 0.005
use_recurrent: false
vis_encode_type: default
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

BananaLearning:
normalize: false
batch_size: 256
buffer_size: 500000
max_steps: 1.0e5
init_entcoef: 0.05
train_interval: 1

VisualBananaLearning:
beta: 1.0e-2
gamma: 0.99
num_epoch: 1
max_steps: 5.0e5
summary_freq: 1000

BouncerLearning:
normalize: true
beta: 0.0
max_steps: 5.0e5
num_layers: 2
hidden_units: 64
summary_freq: 1000

PushBlockLearning:
max_steps: 5.0e4
init_entcoef: 0.05
beta: 1.0e-2
hidden_units: 256
summary_freq: 2000
time_horizon: 64
num_layers: 2

SmallWallJumpLearning:
max_steps: 1.0e6
hidden_units: 256
summary_freq: 2000
time_horizon: 128
init_entcoef: 0.1
num_layers: 2
normalize: false

BigWallJumpLearning:
max_steps: 1.0e6
hidden_units: 256
summary_freq: 2000
time_horizon: 128
num_layers: 2
init_entcoef: 0.1
normalize: false

StrikerLearning:
max_steps: 5.0e5
learning_rate: 1e-3
beta: 1.0e-2
hidden_units: 256
summary_freq: 2000
time_horizon: 128
init_entcoef: 0.1
num_layers: 2
normalize: false

GoalieLearning:
max_steps: 5.0e5
learning_rate: 1e-3
beta: 1.0e-2
hidden_units: 256
summary_freq: 2000
time_horizon: 128
init_entcoef: 0.1
num_layers: 2
normalize: false

PyramidsLearning:
summary_freq: 2000
time_horizon: 128
batch_size: 128
buffer_init_steps: 10000
buffer_size: 500000
hidden_units: 256
num_layers: 2
init_entcoef: 0.01
max_steps: 5.0e5
sequence_length: 16
tau: 0.01
use_recurrent: false
reward_signals:
extrinsic:
strength: 2.0
gamma: 0.99
gail:
strength: 0.02
gamma: 0.99
encoding_size: 128
use_actions: true
demo_path: demos/ExpertPyramid.demo

VisualPyramidsLearning:
time_horizon: 128
batch_size: 64
hidden_units: 256
buffer_init_steps: 1000
num_layers: 1
beta: 1.0e-2
max_steps: 5.0e5
buffer_size: 500000
init_entcoef: 0.01
tau: 0.01
reward_signals:
extrinsic:
strength: 2.0
gamma: 0.99
gail:
strength: 0.02
gamma: 0.99
encoding_size: 128
use_actions: true
demo_path: demos/ExpertPyramid.demo

3DBallLearning:
normalize: true
batch_size: 64
buffer_size: 12000
summary_freq: 1000
time_horizon: 1000
hidden_units: 64
init_entcoef: 0.5
max_steps: 5.0e5

3DBallHardLearning:
normalize: true
batch_size: 256
summary_freq: 1000
time_horizon: 1000
max_steps: 5.0e5

TennisLearning:
normalize: true
max_steps: 2e5

CrawlerStaticLearning:
normalize: true
time_horizon: 1000
batch_size: 256
train_interval: 3
buffer_size: 500000
buffer_init_steps: 2000
max_steps: 5e5
summary_freq: 3000
init_entcoef: 1.0
num_layers: 3
hidden_units: 512

CrawlerDynamicLearning:
normalize: true
time_horizon: 1000
batch_size: 256
buffer_size: 500000
summary_freq: 3000
train_interval: 3
num_layers: 3
max_steps: 5e5
hidden_units: 512

WalkerLearning:
normalize: true
time_horizon: 1000
batch_size: 256
buffer_size: 500000
max_steps: 2e6
summary_freq: 3000
num_layers: 3
train_interval: 3
hidden_units: 512
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

ReacherLearning:
normalize: true
time_horizon: 1000
batch_size: 128
buffer_size: 500000
max_steps: 1e6
summary_freq: 3000

HallwayLearning:
use_recurrent: true
sequence_length: 32
num_layers: 2
hidden_units: 128
memory_size: 256
beta: 0.0
init_entcoef: 0.1
max_steps: 5.0e5
summary_freq: 1000
time_horizon: 64
use_recurrent: true

VisualHallwayLearning:
use_recurrent: true
sequence_length: 32
num_layers: 1
hidden_units: 128
memory_size: 256
beta: 1.0e-2
gamma: 0.99
batch_size: 64
max_steps: 5.0e5
summary_freq: 1000
time_horizon: 64
use_recurrent: true

VisualPushBlockLearning:
use_recurrent: true
sequence_length: 32
num_layers: 1
hidden_units: 128
memory_size: 256
beta: 1.0e-2
gamma: 0.99
buffer_size: 1024
batch_size: 64
max_steps: 5.0e5
summary_freq: 1000
time_horizon: 64

GridWorldLearning:
batch_size: 128
normalize: false
num_layers: 1
hidden_units: 128
init_entcoef: 0.01
buffer_size: 50000
max_steps: 5.0e5
summary_freq: 2000
time_horizon: 5
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.9

BasicLearning:
batch_size: 64
normalize: false
num_layers: 2
init_entcoef: 0.01
hidden_units: 20
max_steps: 5.0e5
summary_freq: 2000
time_horizon: 10
49 changes: 29 additions & 20 deletions docs/Getting-Started-with-Balance-Ball.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,16 +52,16 @@ to speed up training since all twelve agents contribute to training in parallel.

The Academy object for the scene is placed on the Ball3DAcademy GameObject. When
you look at an Academy component in the inspector, you can see several
properties that control how the environment works.
properties that control how the environment works.
The **Broadcast Hub** keeps track of which Brains will send data during training.
If a Brain is added to the hub, the data from this Brain will be sent to the external training
process. If the `Control` checkbox is checked, the training process will be able to
control and train the agents linked to the Brain.
The **Training Configuration** and **Inference Configuration** properties
set the graphics and timescale properties for the Unity application.
The **Training Configuration** and **Inference Configuration** properties
set the graphics and timescale properties for the Unity application.
The Academy uses the **Training Configuration** during training and the
**Inference Configuration** when not training. (*Inference* means that the
Agent is using a trained model or heuristics or direct control — in other
**Inference Configuration** when not training. (*Inference* means that the
Agent is using a trained model or heuristics or direct control — in other
words, whenever **not** training.)
Typically, you would set a low graphics quality and timescale to greater `1.0` for the **Training
Configuration** and a high graphics quality and timescale to `1.0` for the
Expand Down Expand Up @@ -94,8 +94,8 @@ returns the chosen action to the Agent. All Agents can share the same
Brain, but would act independently. The Brain settings tell you quite a bit about how
an Agent works.

You can create new Brain assets by selecting `Assets ->
Create -> ML-Agents -> Brain`. There are 3 types of Brains.
You can create new Brain assets by selecting `Assets ->
Create -> ML-Agents -> Brain`. There are 3 types of Brains.
The **Learning Brain** is a Brain that uses a trained neural network to make decisions.
When the `Control` box is checked in the Brains property under the **Broadcast Hub** in the Academy, the external process that is training the neural network will take over decision making for the agents
and ultimately generate a trained neural network. You can also use the
Expand Down Expand Up @@ -184,18 +184,27 @@ The Ball3DAgent subclass defines the following methods:

Now that we have an environment, we can perform the training.

### Training with PPO
### Training with Deep Reinforcement Learning

In order to train an agent to correctly balance the ball, we will use a
Reinforcement Learning algorithm called Proximal Policy Optimization (PPO). This
is a method that has been shown to be safe, efficient, and more general purpose
than many other RL algorithms, as such we have chosen it as the example
algorithm for use with ML-Agents toolkit. For more information on PPO, OpenAI
has a recent [blog post](https://blog.openai.com/openai-baselines-ppo/)
explaining it.
In order to train an agent to correctly balance the ball, we provide two
deep reinforcement learning algorithms.

To train the agents within the Ball Balance environment, we will be using the
Python package. We have provided a convenient command called `mlagents-learn`
The default algorithm is Proximal Policy Optimization (PPO). This
is a method that has been shown to be more general purpose and stable
than many other RL algorithms. For more information on PPO, OpenAI
has a [blog post](https://blog.openai.com/openai-baselines-ppo/)
explaining it, and [our page](Training-PPO.md) for how to use it in training.

We also provide Soft-Actor Critic, an off-policy algorithm that
has been shown to be both stable and sample-efficient.
For more information on SAC, see UC Berkeley's
[blog post](https://bair.berkeley.edu/blog/2018/12/14/sac/) and
[our page](Training-SAC.md) for more guidance on when to use SAC vs. PPO. To
use SAC to train Balance Ball, replace all references to `config/trainer_config.yaml`
with `config/sac_trainer_config.yaml` below.

To train the agents within the Balance Ball environment, we will be using the
ML-Agents Python package. We have provided a convenient command called `mlagents-learn`
which accepts arguments used to configure both training and inference phases.

We can use `run_id` to identify the experiment and create a folder where the
Expand Down Expand Up @@ -271,9 +280,9 @@ From TensorBoard, you will see the summary statistics:
Once the training process completes, and the training process saves the model
(denoted by the `Saved Model` message) you can add it to the Unity project and
use it with Agents having a **Learning Brain**.
__Note:__ Do not just close the Unity Window once the `Saved Model` message appears.
Either wait for the training process to close the window or press Ctrl+C at the
command-line prompt. If you close the window manually, the `.nn` file
__Note:__ Do not just close the Unity Window once the `Saved Model` message appears.
Either wait for the training process to close the window or press Ctrl+C at the
command-line prompt. If you close the window manually, the `.nn` file
containing the trained model is not exported into the ml-agents folder.

### Embedding the trained model into Unity
Expand Down
1 change: 1 addition & 0 deletions docs/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@

* [Training ML-Agents](Training-ML-Agents.md)
* [Training with Proximal Policy Optimization](Training-PPO.md)
* [Training with Soft Actor-Critic](Training-SAC.md)
* [Training with Curriculum Learning](Training-Curriculum-Learning.md)
* [Training with Imitation Learning](Training-Imitation-Learning.md)
* [Training with LSTM](Feature-Memory.md)
Expand Down

0 comments on commit 5cd2118

Please sign in to comment.