Behavior-Guided-Reinforcement-Learning

{UPDATE: Some theory and explanation of algorithms can be found in my blog}

Some time ago I read the paper Learning to Score Behaviors for Guided Policy Optimization and I really liked it. Unfortunately, the official repo contains only a demo and doesn't provide a clean implementation that can be used to run extensive experiments with.
I had some exciting ideas for possibly improving on the paper and adding some components to it. So, I decided to first implement the paper's methods myself and then continue from there.
As if this wasn't challenging enough, I decided to write the code using JAX (which I knew very little about at the time).
When I was writing the code, I needed to implement a simple policy gradient algorithm. This reminded me of another old interest of mine: I always wanted to implement some of the RL algorithms from scratch, partly to understand them (and all their nitty-gritty implementation details) better and partly to get more confident with coding RL algorithms (this repo was one previous attempt at this that I didn't continue). So again, I thought this is the perfect opportunity for that too. I don't want to get too fancy though, I just need some simple actor-critic methods.

Long story short, this repo is created because I wanted to

implement and extend the methods in the aforementioned paper,
learn JAX,
implement some RL algorithms from scratch.

So far, it has been a successful experience. I now know much more about JAX, I have implemented an off-policy version of REINFORCE (an actor-critic method coming soon) and all of this has resulted in progress towards implementing the paper too.

A todo list (in no particular order):

Implement a simple actor critic method to serve as a baseline (nothing fancy)
(bit of a reach but) Implement PPO
Use JAX to accelerate wherever possible (currently, experience collection is the bottleneck)
Implement some more advanced neural networks (I eventually want to have recurrent policies)
Implement proper logging with Tensorboard (I'm always lazy for logging)
Find some interesting envs and get some initial results on them (current candidates are MiniGrid and some robotic envs)
Implement behavior guided policy gradient (BGPG) and evolutionary strategy (BGES) (the original plan)
Extend BGPG with some of my own ideas

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
algorithm		algorithm
data		data
models		models
.gitignore		.gitignore
BGES.py		BGES.py
BGPG.py		BGPG.py
README.md		README.md
classic-control-test.py		classic-control-test.py
hyperparams.py		hyperparams.py
main.py		main.py
wasserstein.py		wasserstein.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

algorithm

algorithm

data

data

models

models

.gitignore

.gitignore

BGES.py

BGES.py

BGPG.py

BGPG.py

README.md

README.md

classic-control-test.py

classic-control-test.py

hyperparams.py

hyperparams.py

main.py

main.py

wasserstein.py

wasserstein.py

Repository files navigation

Behavior-Guided-Reinforcement-Learning

About

Releases

Packages

Languages

conflictednerd/Behavior-Guided-Reinforcement-Learning

Folders and files

Latest commit

History

Repository files navigation

Behavior-Guided-Reinforcement-Learning

About

Resources

Stars

Watchers

Forks

Languages