Off-policy training #111

Lucaweihs · 2020-08-05T22:42:34Z

Problem

The ADVISOR code-base has support for interleaving off-policy updates (from an arbitrary pytorch dataset and with arbitrary losses) with on-policy updates. It would be great to have similar capabilities here. In particular, we should be able to:

Define a pipeline stage so that it performs some fixed number of off-policy updates.
Allow for off-policy updates to be interleaved with on-policy updates.

Solution

This requires:

A way to specify how this training will occur (e.g. defining off-policy losses and updating the pipeline stage to allow for these types of losses + dataset).
Updating the runner/light_engine to implement these types of updates.

Possible issues:

Currently we log based on the number of rollout steps. This will not be reasonable if we allow a stage to do purely off-policy updates.

Dependencies

None

Lucaweihs · 2020-08-05T23:56:04Z

@jordis-ai2 @klemenkotar -- it might be good to have some discussion about this issue before starting on it to see what our various needs are (e.g. does ALFRED require off-policy training in a certain way?)

jordis-ai2 · 2020-08-22T00:53:47Z

I'll try to port https://github.com/allenai/hex-embodied-rl/blob/advisor_into_master/projects/babyai_baselines/experiments/go_to_local/pure_offpolicy.py before release...

jordis-ai2 · 2020-08-22T01:19:06Z

From Luca:
to run this code you'll want to first run

python extensions/rl_babyai/babyai_scripts/download_babyai_expert_demos.py GoToLocal # Downloads the data
python extensions/rl_babyai/babyai_scripts/truncate_expert_demos.py # Makes a small version of the IL dataset you can work with locally

and then
python go_to_local.pure_offpolicy --experiment_base projects/babyai_baselines/experiments

Lucaweihs · 2020-08-22T01:23:34Z

Feel free to make improvements as you see them: the current implementation is far from perfect so I would not be unhappy if you changed the API. Also of note: off-policy might be a bit tricky when considering the distributed setting (as you'd presumably need some way of partitioning the off-policy dataset into different chunks to be used by the various processes). Not sure what the best solution for this is.

Lucaweihs added the enhancement New feature or request label Aug 5, 2020

Lucaweihs mentioned this issue Aug 5, 2020

Integrate all ADVISOR code #122

Open

Lucaweihs assigned jordis-ai2 and klemenkotar Aug 5, 2020

anikem added this to the 0.1 milestone Aug 12, 2020

jordis-ai2 linked a pull request Aug 22, 2020 that will close this issue

Issue 111 - off-policy training #144

Merged

Lucaweihs closed this as completed in #144 Aug 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Off-policy training #111

Off-policy training #111

Lucaweihs commented Aug 5, 2020

Lucaweihs commented Aug 5, 2020

jordis-ai2 commented Aug 22, 2020 •

edited

jordis-ai2 commented Aug 22, 2020

Lucaweihs commented Aug 22, 2020 •

edited

Off-policy training #111

Off-policy training #111

Comments

Lucaweihs commented Aug 5, 2020

Problem

Solution

Dependencies

Lucaweihs commented Aug 5, 2020

jordis-ai2 commented Aug 22, 2020 • edited

jordis-ai2 commented Aug 22, 2020

Lucaweihs commented Aug 22, 2020 • edited

jordis-ai2 commented Aug 22, 2020 •

edited

Lucaweihs commented Aug 22, 2020 •

edited