Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GAIL and Pretraining #2118

Merged
merged 146 commits into from
Jul 16, 2019
Merged
Show file tree
Hide file tree
Changes from 139 commits
Commits
Show all changes
146 commits
Select commit Hold shift + click to select a range
eb4abf2
New version of GAIL
awjuliani Oct 9, 2018
d0852ac
Move Curiosity to separate class
awjuliani Oct 12, 2018
4b15b80
Curiosity fully working under new system
awjuliani Oct 12, 2018
ad9381b
Begin implementing GAIL
awjuliani Oct 12, 2018
8bf8302
fix discrete curiosity
vincentpierre Oct 12, 2018
d3e244e
Add expert demonstration
awjuliani Oct 13, 2018
a5b95f7
Remove notebook
awjuliani Oct 13, 2018
dc2fcaa
Record intrinsic rewards properly
awjuliani Oct 13, 2018
49cff40
Add gail model updating
awjuliani Oct 13, 2018
48d3769
Code cleanup
awjuliani Oct 15, 2018
6eeb565
Nested structure for intrinsic rewards
awjuliani Oct 15, 2018
8ca7728
Rename files
awjuliani Oct 15, 2018
226b5c7
Update models so files
awjuliani Oct 15, 2018
3386aa7
fix typo
awjuliani Oct 15, 2018
6799756
Add reward strength parameter
awjuliani Oct 15, 2018
468c407
Use dictionary of reward signals
awjuliani Oct 17, 2018
519e2d3
Remove reward manager
awjuliani Oct 17, 2018
7df1a69
Extrinsic reward just another type
awjuliani Oct 17, 2018
99237cd
Clean up imports
awjuliani Oct 17, 2018
9fa51c1
All reward signals use strength to scale output
awjuliani Oct 17, 2018
7f24677
produce scaled and unscaled reward
awjuliani Oct 18, 2018
4a714d0
Remove unused dictionary
awjuliani Oct 18, 2018
3e2671d
Current trainer config
awjuliani Oct 18, 2018
77211d8
Add discrete control and pyramid experimentation
awjuliani Oct 19, 2018
2334de8
Minor changes to GAIL
awjuliani Oct 20, 2018
439387e
Add relevant strength parameters
awjuliani Oct 21, 2018
ba793a3
Replace string
awjuliani Oct 21, 2018
a52ba0b
Add support for visual observations w/ GAIL
awjuliani Oct 31, 2018
5b2ef22
Finish implementing visual obs for GAIL
awjuliani Nov 1, 2018
13542b4
Include demo files
awjuliani Nov 1, 2018
ae7a8b0
Fix for RNN w/ GAIL
awjuliani Nov 1, 2018
bf89082
Keep track of reward streams separately
awjuliani Nov 2, 2018
360482b
Bootstrap value estimates separately
awjuliani Nov 2, 2018
c78639d
Add value head
awjuliani Nov 14, 2018
3b2485d
Use sepaprate value streams for each reward
awjuliani Nov 15, 2018
40bc9ba
Add VAIL
awjuliani Nov 15, 2018
c6e1504
Use adaptive B
awjuliani Nov 16, 2018
60d9ff7
Comments improvements
vincentpierre Jan 10, 2019
49ec682
Added comments and refactored a pievce of the code
vincentpierre Jan 10, 2019
d9847e0
Added Comments
vincentpierre Jan 10, 2019
dc7620b
Fix on Curriosity
vincentpierre Jan 11, 2019
28e0bd5
Fixed typo
vincentpierre Jan 11, 2019
0257d2b
Added a forgotten comment
vincentpierre Jan 11, 2019
fd55c00
Stabilized Vail learning. Still no learning for Walker
vincentpierre Jan 14, 2019
2343b3f
Fixing typo on curiosity when using visual input
vincentpierre Jan 17, 2019
c74ad19
Added some comments
vincentpierre Jan 17, 2019
2dd7c61
modified the hyperparameters
vincentpierre Jan 17, 2019
42429a5
Fixed some of the tests, will need to refactor the reward signals in …
vincentpierre Jan 19, 2019
ec0e106
Putting the has_updated fags inside each reward signal
vincentpierre Jan 22, 2019
6ae1c2f
Added comments for the GAIL update method
vincentpierre Jan 22, 2019
ef65bc2
initial commit
vincentpierre Jan 24, 2019
8cbdbf4
No more normalization after pre-training
vincentpierre Jan 24, 2019
3f35d45
Fixed large bug in Vail
vincentpierre Jan 30, 2019
3be9be7
BUG FIX VAIL : The noise dimension was wrong and the discriminator sc…
vincentpierre Feb 1, 2019
9e9b4ff
implemented discrete control pretraining
vincentpierre Feb 2, 2019
d537a6b
bug fixing
vincentpierre Feb 3, 2019
713263c
Bug fix, still not tested for recurrent
vincentpierre Feb 6, 2019
ca5b948
Fixing beta in GAIL so it will change properly
vincentpierre Mar 6, 2019
671629e
Allow for not specifying an extrinsic reward
Apr 19, 2019
a31c8a5
Rough implementation of annealed BC
Apr 24, 2019
93cb4ff
Fixes for rebase onto v0.8
Apr 24, 2019
6534291
Moved BC trainer out of reward_signals and code cleanup
Apr 25, 2019
700b478
Rename folder to "components"
Apr 25, 2019
71eedf5
Fix renaming in Curiosity
Apr 25, 2019
83b4603
Remove demo_aided as a required param
May 2, 2019
9e4b4e2
Make old BC compatible
May 2, 2019
f814432
Fix visual obs for curiosity
May 3, 2019
e10194f
Tweaks all around
May 9, 2019
fdcfb30
Add reward normalization and bug fix
May 9, 2019
cb5e927
Load multiple .demo files. Fix bug with csv nans
May 30, 2019
2c5c853
Remove reward normalization
May 30, 2019
e66a343
Rename demo_aided to pretraining
May 30, 2019
0a98289
Fix bc configs
May 30, 2019
cd6e498
Increase small val to prevent NaNs
May 30, 2019
d23f6f3
Fix init in components
May 31, 2019
d93e36e
Merge remote-tracking branch 'origin/develop' into develop-irl-ervin
May 31, 2019
1bf68c7
Fix PPO tests
May 31, 2019
9da6e6c
Refactor components into common location
May 31, 2019
4a57a32
Minor code cleanup
Jun 3, 2019
11cc6f9
Preliminary RNN support
Jun 5, 2019
e66a6f7
Revert regression with NaNs for LSTMs
Jun 6, 2019
bea2bc7
Better LSTM support for BC
Jun 6, 2019
6302a55
Code cleanup and black reformat
Jun 6, 2019
d1cded9
Remove demo_helper and reformat signal
Jun 6, 2019
2b98f3b
Tests for GAIL and curiosity
Jun 6, 2019
440146b
Fix Black again...
Jun 6, 2019
98f9160
Tests for BCModule and visual tests for RewardSignals
Jun 6, 2019
5c923cb
Refactor to new structure and use class generator
Jun 7, 2019
e7ce888
Generalize reward_signal interface and stats
Jun 8, 2019
858194f
Fix incorrect environment reward reporting
Jun 10, 2019
28bceba
Rename reward signals for consistency. clean up comments
Jun 10, 2019
248cae4
Default trainer config (for cloud testing)
Jun 10, 2019
744df94
Remove "curiosity_enc_size" from the regular params
Jun 10, 2019
31dabfc
Fix PushBlock config
Jun 10, 2019
a557f84
Revert Pyramids environment
Jun 10, 2019
d4dbddb
Fix indexing issue with add_experiences
Jun 11, 2019
ddb673b
Fix tests
Jun 11, 2019
975e05b
Change to BCModule
Jun 11, 2019
a83fd5d
Merge branch 'develop' into develop-irl-ervin
Jun 12, 2019
fae7646
Remove the bools for reward signals
Jun 12, 2019
5cf98ac
Make update take in a mini buffer rather than the
Jun 13, 2019
d1afc9b
Always reference reward signals name and not index
Jun 13, 2019
80f2c75
More code cleanup
Jun 13, 2019
394b25a
Clean up reward_signal abstract class
Jun 13, 2019
a9724a3
Fix issue with recording values
Jun 13, 2019
66fef61
Add use_actions to GAIL
Jun 17, 2019
0e3be1d
Add documentation for Reward Signals
Jun 17, 2019
015f50d
Add documentation for GAIL
Jun 17, 2019
7c3059b
Remove unused variables in BCModel
Jun 17, 2019
16c3c06
Remove Entropy Reward Signal
Jun 17, 2019
1fbfa5d
Change tests to use safe_load
Jun 17, 2019
f9a3808
Don't use mutable default
Jun 17, 2019
ce551bf
Set defaults in parent __init__ (Reward Signals)
Jun 17, 2019
3e7ea5b
Remove unneccesary lines
Jun 17, 2019
eda6993
Merge branch 'develop' into develop-irl-ervin
Jul 3, 2019
cace2e6
Make some files same as develop
Jul 3, 2019
3f161fc
Add demos for example envs
Jul 4, 2019
2794c75
Update docs
Jul 4, 2019
48b7b43
Fix tests, imports, cleanup code
Jul 8, 2019
f47b173
Make pretrainer stats similar to reward signal
Jul 9, 2019
1e257d4
Merge branch 'develop' of github.com:Unity-Technologies/ml-agents int…
Jul 9, 2019
a8b5d09
Fixes after merge develop
Jul 10, 2019
fb3d5ae
Additional tests, bugfix for LSTM+BC+Visual
Jul 10, 2019
7e0a677
GAIL code cleanup
Jul 10, 2019
1953233
Add types to BCModel
Jul 10, 2019
593f819
Fix bugs with incorrect return values
Jul 11, 2019
98b7732
Change tests to use RewardSignalResult
Jul 11, 2019
6ee0c63
Add docs for pretraining and plot for all three
Jul 11, 2019
6d37be2
Fix bug with demo loading directories, add test
Jul 11, 2019
c672ad9
Add typing to BCModule, GAIL, and demo loader
Jul 11, 2019
61e84c6
Fix black
Jul 11, 2019
9d43336
Fix mypy issues
Jul 11, 2019
99a2a3c
Codacy cleanup
Jul 12, 2019
cbb1af3
Doc fixes
Jul 12, 2019
736c807
More sophisticated tests for reward signals
Jul 13, 2019
04e22fd
Fix bug in GAIL when num_sequences is 1
Jul 13, 2019
8ead02e
Clean up use_vail and feed_dicts
Jul 15, 2019
71f85e1
Change to swish from learningmodel
Jul 15, 2019
5537e60
Make variables more readable
Jul 15, 2019
73d20cb
Code and comment cleanup
Jul 15, 2019
f4950b4
Not all should be swish
Jul 15, 2019
6784ee6
Remove prints
Jul 15, 2019
2704e62
Doc updates
Jul 15, 2019
1206a89
Make VAIL default false, improve logging
Jul 15, 2019
2407a5a
Fix tests for sequences
Jul 16, 2019
4aa033b
Change max_batches and set VAIL to default to false
Jul 16, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added demos/Expert3DBall.demo
Binary file not shown.
Binary file added demos/Expert3DBallHard.demo
Binary file not shown.
Binary file added demos/ExpertBanana.demo
Binary file not shown.
Binary file added demos/ExpertBasic.demo
Binary file not shown.
Binary file added demos/ExpertBouncer.demo
Binary file not shown.
Binary file added demos/ExpertCrawlerDyn.demo
Binary file not shown.
Binary file added demos/ExpertCrawlerSta.demo
Binary file not shown.
Binary file added demos/ExpertGrid.demo
Binary file not shown.
Binary file added demos/ExpertHallway.demo
Binary file not shown.
Binary file added demos/ExpertPush.demo
Binary file not shown.
Binary file added demos/ExpertPyramid.demo
Binary file not shown.
Binary file added demos/ExpertReacher.demo
Binary file not shown.
Binary file added demos/ExpertSoccerGoal.demo
Binary file not shown.
Binary file added demos/ExpertSoccerStri.demo
Binary file not shown.
Binary file added demos/ExpertTennis.demo
Binary file not shown.
Binary file added demos/ExpertWalker.demo
Binary file not shown.
92 changes: 92 additions & 0 deletions docs/Training-BehavioralCloning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Training with Behavioral Cloning

There are a variety of possible imitation learning algorithms which can
be used, the simplest one of them is Behavioral Cloning. It works by collecting
demonstrations from a teacher, and then simply uses them to directly learn a
policy, in the same way the supervised learning for image classification
or other traditional Machine Learning tasks work.

## Offline Training

With offline behavioral cloning, we can use demonstrations (`.demo` files)
generated using the `Demonstration Recorder` as the dataset used to train a behavior.

1. Choose an agent you would like to learn to imitate some set of demonstrations.
2. Record a set of demonstration using the `Demonstration Recorder` (see [here](Training-Imitation-Learning.md)).
For illustrative purposes we will refer to this file as `AgentRecording.demo`.
3. Build the scene, assigning the agent a Learning Brain, and set the Brain to
Control in the Broadcast Hub. For more information on Brains, see
[here](Learning-Environment-Design-Brains.md).
4. Open the `config/offline_bc_config.yaml` file.
5. Modify the `demo_path` parameter in the file to reference the path to the
demonstration file recorded in step 2. In our case this is:
`./UnitySDK/Assets/Demonstrations/AgentRecording.demo`
6. Launch `mlagent-learn`, providing `./config/offline_bc_config.yaml`
as the config parameter, and include the `--run-id` and `--train` as usual.
Provide your environment as the `--env` parameter if it has been compiled
as standalone, or omit to train in the editor.
7. (Optional) Observe training performance using TensorBoard.

This will use the demonstration file to train a neural network driven agent
to directly imitate the actions provided in the demonstration. The environment
will launch and be used for evaluating the agent's performance during training.

## Online Training

It is also possible to provide demonstrations in realtime during training,
without pre-recording a demonstration file. The steps to do this are as follows:

1. First create two Brains, one which will be the "Teacher," and the other which
will be the "Student." We will assume that the names of the Brain
Assets are "Teacher" and "Student" respectively.
2. The "Teacher" Brain must be a **Player Brain**. You must properly
configure the inputs to map to the corresponding actions.
3. The "Student" Brain must be a **Learning Brain**.
4. The Brain Parameters of both the "Teacher" and "Student" Brains must be
compatible with the agent.
5. Drag both the "Teacher" and "Student" Brain into the Academy's `Broadcast Hub`
and check the `Control` checkbox on the "Student" Brain.
6. Link the Brains to the desired Agents (one Agent as the teacher and at least
one Agent as a student).
7. In `config/online_bc_config.yaml`, add an entry for the "Student" Brain. Set
the `trainer` parameter of this entry to `online_bc`, and the
`brain_to_imitate` parameter to the name of the teacher Brain: "Teacher".
Additionally, set `batches_per_epoch`, which controls how much training to do
each moment. Increase the `max_steps` option if you'd like to keep training
the Agents for a longer period of time.
8. Launch the training process with `mlagents-learn config/online_bc_config.yaml
--train --slow`, and press the :arrow_forward: button in Unity when the
message _"Start training by pressing the Play button in the Unity Editor"_ is
displayed on the screen
9. From the Unity window, control the Agent with the Teacher Brain by providing
"teacher demonstrations" of the behavior you would like to see.
10. Watch as the Agent(s) with the student Brain attached begin to behave
similarly to the demonstrations.
11. Once the Student Agents are exhibiting the desired behavior, end the training
process with `CTL+C` from the command line.
12. Move the resulting `*.nn` file into the `TFModels` subdirectory of the
Assets folder (or a subdirectory within Assets of your choosing) , and use
with `Learning` Brain.

**BC Teacher Helper**

We provide a convenience utility, `BC Teacher Helper` component that you can add
to the Teacher Agent.

<p align="center">
<img src="images/bc_teacher_helper.png"
alt="BC Teacher Helper"
width="375" border="10" />
</p>

This utility enables you to use keyboard shortcuts to do the following:

1. To start and stop recording experiences. This is useful in case you'd like to
interact with the game _but not have the agents learn from these
interactions_. The default command to toggle this is to press `R` on the
keyboard.

2. Reset the training buffer. This enables you to instruct the agents to forget
their buffer of recent experiences. This is useful if you'd like to get them
to quickly learn a new behavior. The default command to reset the buffer is
to press `C` on the keyboard.
125 changes: 30 additions & 95 deletions docs/Training-Imitation-Learning.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,35 @@ from the game and actions from a game controller to guide the medic's behavior.
Imitation Learning uses pairs of observations and actions from
from a demonstration to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFYs).

Imitation learning can also be used to help reinforcement learning. Especially in
environments with sparse (i.e., infrequent or rare) rewards, the agent may never see
the reward and thus not learn from it. Curiosity helps the agent explore, but in some cases
it is easier to just show the agent how to achieve the reward. In these cases,
imitation learning can dramatically reduce the time it takes to solve the environment.
For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
just 6 episodes of demonstrations can reduce training steps by more than 4 times.

<p align="center">
<img src="images/mlagents-ImitationAndRL.png"
alt="Using Demonstrations with Reinforcement Learning"
width="350" border="0" />
</p>

ML-Agents provides several ways to interact with demonstrations. For most situations,
ervteng marked this conversation as resolved.
Show resolved Hide resolved
[GAIL](Training-RewardSignals.md#the-gail-reward-signal) is the preferred approach.

* To train using GAIL (Generative Adversarial Imitaiton Learning) you can add the
[GAIL reward signal](Training-RewardSignals.md#the-gail-reward-signal). GAIL can be
used with or without environment rewards, and works well when there are a limited
number of demonstrations.
* To help bootstrap reinforcement learning, you can enable
[pretraining](Training-PPO.md#optional-pretraining-using-demonstrations)
on the PPO trainer, in addition to using a small GAIL reward signal.
* To train an agent to exactly mimic demonstrations, you can use the
[Behavioral Cloning](Training-BehavioralCloning.md) trainer. Behavioral Cloning can be
used offline and online (in-editor), and learns very quickly. However, it usually is ineffective
on more complex environments without a large number of demonstrations.

## Recording Demonstrations

It is possible to record demonstrations of agent behavior from the Unity Editor,
Expand Down Expand Up @@ -43,98 +72,4 @@ inspector.
alt="BC Teacher Helper"
width="375" border="10" />
</p>


## Training with Behavioral Cloning

There are a variety of possible imitation learning algorithms which can
be used, the simplest one of them is Behavioral Cloning. It works by collecting
demonstrations from a teacher, and then simply uses them to directly learn a
policy, in the same way the supervised learning for image classification
or other traditional Machine Learning tasks work.


### Offline Training

With offline behavioral cloning, we can use demonstrations (`.demo` files)
generated using the `Demonstration Recorder` as the dataset used to train a behavior.

1. Choose an agent you would like to learn to imitate some set of demonstrations.
2. Record a set of demonstration using the `Demonstration Recorder` (see above).
For illustrative purposes we will refer to this file as `AgentRecording.demo`.
3. Build the scene, assigning the agent a Learning Brain, and set the Brain to
Control in the Broadcast Hub. For more information on Brains, see
[here](Learning-Environment-Design-Brains.md).
4. Open the `config/offline_bc_config.yaml` file.
5. Modify the `demo_path` parameter in the file to reference the path to the
demonstration file recorded in step 2. In our case this is:
`./UnitySDK/Assets/Demonstrations/AgentRecording.demo`
6. Launch `mlagent-learn`, providing `./config/offline_bc_config.yaml`
as the config parameter, and include the `--run-id` and `--train` as usual.
Provide your environment as the `--env` parameter if it has been compiled
as standalone, or omit to train in the editor.
7. (Optional) Observe training performance using TensorBoard.

This will use the demonstration file to train a neural network driven agent
to directly imitate the actions provided in the demonstration. The environment
will launch and be used for evaluating the agent's performance during training.

### Online Training

It is also possible to provide demonstrations in realtime during training,
without pre-recording a demonstration file. The steps to do this are as follows:

1. First create two Brains, one which will be the "Teacher," and the other which
will be the "Student." We will assume that the names of the Brain
Assets are "Teacher" and "Student" respectively.
2. The "Teacher" Brain must be a **Player Brain**. You must properly
configure the inputs to map to the corresponding actions.
3. The "Student" Brain must be a **Learning Brain**.
4. The Brain Parameters of both the "Teacher" and "Student" Brains must be
compatible with the agent.
5. Drag both the "Teacher" and "Student" Brain into the Academy's `Broadcast Hub`
and check the `Control` checkbox on the "Student" Brain.
6. Link the Brains to the desired Agents (one Agent as the teacher and at least
one Agent as a student).
7. In `config/online_bc_config.yaml`, add an entry for the "Student" Brain. Set
the `trainer` parameter of this entry to `online_bc`, and the
`brain_to_imitate` parameter to the name of the teacher Brain: "Teacher".
Additionally, set `batches_per_epoch`, which controls how much training to do
each moment. Increase the `max_steps` option if you'd like to keep training
the Agents for a longer period of time.
8. Launch the training process with `mlagents-learn config/online_bc_config.yaml
--train --slow`, and press the :arrow_forward: button in Unity when the
message _"Start training by pressing the Play button in the Unity Editor"_ is
displayed on the screen
9. From the Unity window, control the Agent with the Teacher Brain by providing
"teacher demonstrations" of the behavior you would like to see.
10. Watch as the Agent(s) with the student Brain attached begin to behave
similarly to the demonstrations.
11. Once the Student Agents are exhibiting the desired behavior, end the training
process with `CTL+C` from the command line.
12. Move the resulting `*.nn` file into the `TFModels` subdirectory of the
Assets folder (or a subdirectory within Assets of your choosing) , and use
with `Learning` Brain.

**BC Teacher Helper**

We provide a convenience utility, `BC Teacher Helper` component that you can add
to the Teacher Agent.

<p align="center">
<img src="images/bc_teacher_helper.png"
alt="BC Teacher Helper"
width="375" border="10" />
</p>

This utility enables you to use keyboard shortcuts to do the following:

1. To start and stop recording experiences. This is useful in case you'd like to
interact with the game _but not have the agents learn from these
interactions_. The default command to toggle this is to press `R` on the
keyboard.

2. Reset the training buffer. This enables you to instruct the agents to forget
their buffer of recent experiences. This is useful if you'd like to get them
to quickly learn a new behavior. The default command to reset the buffer is
to press `C` on the keyboard.

70 changes: 68 additions & 2 deletions docs/Training-PPO.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,7 @@ If you are using curriculum training to pace the difficulty of the learning task
presented to an agent, see [Training with Curriculum
Learning](Training-Curriculum-Learning.md).

For information about imitation learning, which uses a different training
algorithm, see
For information about imitation learning from demonstrations, see
[Training with Imitation Learning](Training-Imitation-Learning.md).

## Best Practices when training with PPO
Expand Down Expand Up @@ -191,6 +190,73 @@ the agent will need to remember in order to successfully complete the task.

Typical Range: `64` - `512`

## (Optional) Pretraining Using Demonstrations

In some cases, we would want to bootstrap the agent's policy using behavior recorded
ervteng marked this conversation as resolved.
Show resolved Hide resolved
from a player. This can help guide the agent towards the reward. Pretraining adds
training operations that mimic a demonstration rather than attempting to maximize reward.
It is essentially equivalent to running [behavioral cloning](./Training-BehavioralCloning.md)
in-line with PPO.

To use pretraining, add a `pretraining` section to the trainer_config. For instance:

```
pretraining:
demo_path: ./demos/ExpertPyramid.demo
strength: 0.5
steps: 10000
```

Below are the avaliable hyperparameters for pretraining.

### Strength

`strength` corresponds to the learning rate of the imitation relative to the learning
rate of PPO, and roughly corresponds to how strongly we allow the behavioral cloning
to influence the policy.

Typical Range: `0.1` - `0.5`

### Demo Path

`demo_path` is the path to your `.demo` file or directory of `.demo` files.
See the [imitation learning guide](Training-ImitationLearning.md) for more on `.demo` files.

### Steps

During pretraining, it is often desirable to stop using demonstrations after the agent has
"seen" rewards, and allow it to optimize past the available demonstrations and/or generalize
outside of the provided demonstrations. `steps` corresponds to the training steps over which
pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
the steps to 0 for constant imitation over the entire training run.

### (Optional) Batch Size

`batch_size` is the number of demonstration experiences used for one iteration of a gradient
descent update. If not specified, it will default to the `batch_size` defined for PPO.

Typical Range (Continuous): `512` - `5120`

Typical Range (Discrete): `32` - `512`

### (Optional) Number of Epochs

`num_epoch` is the number of passes through the experience buffer during
gradient descent. If not specified, it will default to the number of epochs set for PPO.

Typical Range: `3` - `10`

### (Optional) Max Batches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one might be a little confusing to people.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I also think it's necessary for people who have a huge demonstration dataset. We could do a couple of things:

  • Remove the option and just set to the buffer size given by PPO - perhaps allow overriding
  • Change the option to Samples Per Update or Demonstration Buffer Size
  • Leave as-is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is useful to have. I think it just needs a different name that is a little more descriptive. "Samples Per Update" could fit the bill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! (and yes any issues were related to the stochasticity)


`max_batches` are the maximum number of batches of `batch_size`
to use during each imitation update. You may want to lower this if your demonstration
dataset is very large to avoid overfitting the policy on demonstrations. Set to 0
to train over all of the demonstrations at each update step.

Default Value: `0` (all)

Typical Range: `10`-`20`

## Training Statistics

To view training statistics, use TensorBoard. For information on launching and
Expand Down
Loading