Reinforcement learning in Keras

This repo aims to implement various reinforcement learning agents using Keras (tf==2.2.0) and sklearn, for use with OpenAI Gym environments.

Planned agents

Methods
- Off-policy
  - Linear Q learning
    - Mountain car
    - CartPole
  - Deep Q learning
    - Mountain car
    - CartPole
    - Pong
    - Vizdoom
    - GFootball
    - Model extensions
      - Replay buffer
      - Unrolled Bellman
      - Dueling architecture
      - Multiple environments
      - Double DQN
      - Noisy network
  - Policy gradient methods
    - REINFORCE
      - Mountain car
      - CartPole
      - Pong
    - Actor-critic
      - Mountain car
      - CartPole
      - Pong

General references

Deep reinforcement learning hands-on, 2nd edition, Maxim Lapan
The Lazy Programmers' courses:
Lilian Weng's overviews of reinforcement learning. I try and use the same terminology as used in these posts.
- A (Long) Peek into Reinforcement Learning
- Policy Gradient Algorithms
Multiple Github repos and Medium posts on individual techniques - these are cited in context.

Set-up

git clone 
cd reinforcement-learning-keras
pip install -r requirements.txt

Implemented algorithms and environment examples

Deep Q learner

Pong

Pong-NoFrameSkip-v4 with various wrappers.

Model:
State -> action model -> [value for action 1, value for action 2]

A deep Q learning agent that uses small neural network to approximate Q(s, a). It includes a replay buffer that allows for batched training updates, this is important for 2 reasons:

As this method is off-policy (the action is selected as argmax(action values)), it can train on data collected during previous episodes. This reduces correlation in the training data.
This is important for performance, especially when using a GPU. Calling multiple predict/train operations on single rows inside a loop is very inefficient.

This agent uses two copies of its model:

One to predict the value of the next action, which us updated every episode step (with a batch sampled from the replay buffer)
One to predict value of the actions in the current and next state for calculating the discounted reward. This model is updated with the weights from the first model at the end of each episode.

Run example

from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.atari.pong.pong_config import PongConfig

VirtualGPU(4096) 
agent = DeepQAgent(**PongConfig('dqn').build())
agent.train(verbose=True, render=True, max_episode_steps=10000)

Cart-pole

Using cart-pole-v0 with step limit increased from 200 to 500.

![Episode play example]images/DQNAgent.gif) ![Convergence]images/DQNAgent.png)

Run example

from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig

VirtualGPU(256) 
agent = DeepQAgent(**CartPoleConfig('dqn').build())
agent.train(verbose=True, render=True)

MountainCar (not well tuned)

Run example

from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.mountain_car.mountain_car_config import MountainCarConfig

VirtualGPU(256)
agent = DeepQAgent(**MountainCarConfig('dqn').build())
agent.train(verbose=True, render=True, max_episode_steps=1500)

Extensions

Dueling DQN

![Episode play example]images/DuelingDQNAgent.gif) ![Convergence]images/DuelingDQNAgent.png)

The dueling version is exactly the same as the DQN, expect with slightly different model architecture. The second to last layer is split into two layers with the units=1 and units=n_actions. The idea is that the model might learn V(s) and action advantages (A(s)) separately, which can speed up convergence.

The output of the network is still action values, however preceding layers are not fully connected; the values are now V(s) + A(s) and a subsequent Keras lambda layer is used to calculate the action advantages.

Run example

from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig

VirtualGPU(256) 
agent = DeepQAgent(**CartPoleConfig('dueling_dqn').build())
agent.train(verbose=True, render=True)

Linear Q learner

Mountain car

Model:
State -> model for action 1 -> value for action 1
State -> model for action 2 -> value for action 2

This agent is based on The Lazy Programmers 2nd reinforcement learning course implementation. It uses a separate SGDRegressor models for each action to estimate Q(a|s). Each step, the model for the selected action is updated using .partial_fit. Action selection is off-policy and uses epsilon greedy; the selected either the argmax of action values, or a random action, depending on the current value of epsilon.

Environment observations are preprocessed in an sklearn pipeline that clips, scales, and creates features using RBFSampler.

from rlk.agents.q_learning.linear_q_agent import LinearQAgent
from rlk.environments.mountain_car.mountain_car_config import MountainCarConfig

agent = LinearQAgent(**MountainCarConfig('linear_q').build())
agent.train(verbose=True, render=True, max_episode_steps=1500)

CartPole

Run example

from rlk.agents.q_learning.linear_q_agent import LinearQAgent
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig 

agent = LinearQAgent(**CartPoleConfig('linear_q').build())
agent.train(verbose=True, render=True)

REINFORCE (policy gradient)

CartPole

![Episode play example]images/REINFORCEAgent.gif)

Model:
State -> model -> [probability of action 1, probability of action 2]
Refs:
https://github.com/Alexander-H-Liu/Policy-Gradient-and-Actor-Critic-Keras

Policy gradient models move the action selection policy into the model, rather than using argmax(action values). Model outputs are action probabilities rather than values (π(a|s), where π is the policy), making these methods inherently stochastic and removing the need for epsilon greedy action selection.

This agent uses a small neural network to predict action probabilities given a state. Updates are done in a Monte-Carlo fashion - ie. using all steps from a single episode. This removes the need for a complex replay buffer (list.append() does the job). However as the method is on-policy it requires data from the current policy for training. This means training data can't be collected across episodes (assuming policy is updated at the end of each). This means the training data in each batch (episode) is highly correlated, which slows convergence.

This model doesn't use any scaling or clipping for environment pre-processing. For some reason, using the same pre-processing as with the DQN models prevents it from converging. The cart-pole environment can potentially return really huge values when sampling from the observation space, but these are rarely seen during training. It seems to be fine to pretend they don't exist, rather than scaling inputs based environment samples, as done with in the other methods.

from rlk.agents.policy_gradient.reinforce_agent import ReinforceAgent
from tf2_vgpu import VirtualGPU
from rlk.environments.cart_pole.cart_pole_config import CartPoleConfig

VirtualGPU(256)
agent = ReinforceAgent(**CartPoleConfig('reinforce').build())
agent.train(verbose=True, render=True)

Doom

Set up

Install these two packages:

Additionally, to save monitor wrapper output, install the following packages:

sudo apt install libcanberra-gtk-module libcanberra-gtk3-module

VizdoomBasic-v0

DQN

from tf2_vgpu import VirtualGPU
from rlk.agents.q_learning.deep_q_agent import DeepQAgent
from rlk.environments.doom.vizdoom_basic_config import VizDoomBasicConfig

VirtualGPU(256)
agent = DeepQAgent(**VizDoomBasicConfig(agent_type='dqn', mode='stack').build())
agent.train(n_episodes=1000, max_episode_steps=10000, verbose=True, render=True)

VizDoomCorridor-v0

Double dueling DQN

The DQNs struggle to solve this environment on their own. See scripts and readme in scripts/doom/ for an example training with additional experience collection with (scripted) bots.

GFootball

Work in progress. Involves pre-training the agent on historical data, and sampling experience from (policy) bots.

See notes in scripts/gfootball/readme.md

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github/workflows		.github/workflows
images		images
reinforcement_learning_keras		reinforcement_learning_keras
rlk		rlk
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

garethjns/reinforcement-learning-keras

Folders and files

Latest commit

History

Repository files navigation

Reinforcement learning in Keras

Planned agents

General references

Set-up

Implemented algorithms and environment examples

Deep Q learner

Pong

Run example

Cart-pole

Run example

MountainCar (not well tuned)

Run example

Extensions

Dueling DQN

Run example

Linear Q learner

Mountain car

CartPole

Run example

REINFORCE (policy gradient)

CartPole

Doom

Set up

VizdoomBasic-v0

DQN

VizDoomCorridor-v0

Double dueling DQN

GFootball

About

Topics

Resources

License

Stars

Watchers

Forks

Languages