Skip to content

alirezakazemipour/A3C-ACER-PyTorch

Repository files navigation

PRs Welcome

A3C-ACER-PyTorch

This repository contains PyTorch Implementation of papers Sample Efficient Actor-Critic with Experience Replay (a.k.a ACER) and, Asynchronous Methods for Deep Reinforcement Learning (a.k.a. A3C.)

The A3C paper introduced some key ideas that can be summarized into:

  1. Asynchronous updates from multiple parallel agents to decorrelates the agent's data into a more stationary process rather than maintaining an Experience Replay Memory. Consequently, exceeding limits of off-policy methods and also, reducing memory computation per real interaction with the environment.
  2. Architectures that share layers between the policy and value function. Thus, providing better and more efficient representation learning and feature extraction.
  3. An updating scheme that operates on fixed-length segments of experiences (say, 5 or 20 time-steps) that increase the stationarity of the agent's data.

But:
A3C's lack of Experience Replay means it is considerably Sample-Inefficient and the number of interactions with the environment, needed to solve the task, is consequently high.

Based on this deficit of the A3C, ACER introduces an actor-critic method upon A3C's core structure accompanied by the benefits of having thread-based Experience Replays to improve sample efficiency. More precisely, in the ACER algorithm, each of the parallel agents, on the one hand, performs A3C-like on-policy updates, and on the other, has its own Experience Replay buffer to perform off-policy updates.

Also, the ACER utilizes more advanced techniques like Truncated Importance Sampling with Bias Correction, Stochastic Dueling Network Architectures and, Efficient Trust Region Policy Optimization to further improve stability (which is a common challenge in Policy Gradient methods) and also helps increasing Sample Efficiency even more.

This repository contains the discrete implementation of the ACER here and the A3C's here.

Although, continuous implementations are also provided here for the ACER and here for A3C, they have not been tested yet and they will be added to the current work whenever they're suitably debugged and validated in the future.

Results

number of parallel agents = 8.
x-axis corresponds episode number.

ACER's Output ACER's Output
Running Episode Reward Running Episode Reward
Running Episode Length Running Episode Length

Comparison

number of parallel agents = 2
x-axis corresponds episode number.

ACER's Output Recurrent A3C's Output
Running Episode Reward Running Episode Reward
Running Episode Length Running Episode Length
  • The Sample Efficiency promised by the ACER is obvious as it can be seen on the left plot that the score of ≅21 has been achieved in 600 episodes vs. 1.7k episodes of the Recurrent A3C on the right.

Table of Hyperparameters

Parameter Value
lr 1e-4
entropy coefficient 0.001
gamma 0.99
k (rollout length) 20
total memory size (Aggregation of all parallel agents' replay buffers) 6e+5
per agent replay memory size 6e+5 // (number of agents * rollout length)
c (used in truncated importance sampling) 10
δ (delta used in trust-region computation) 1
replay ratio 4
polyak average coefficients 0.01 ( = 1 - 0.99)
critic loss coefficient 0.5
max grad norm 40

Dependencies

  • PyYAML == 5.4.1
  • cronyo == 0.4.5
  • gym == 0.17.3
  • numpy == 1.19.2
  • opencv_contrib_python == 4.4.0.44
  • psutil == 5.5.1
  • torch == 1.6.0

Installation

pip3 install -r requirements.txt

Usage

How to Run

usage: main.py [-h] [--env_name ENV_NAME] [--interval INTERVAL] [--do_train]
               [--train_from_scratch] [--seed SEED]

Variable parameters based on the configuration of the machine or user's choice

optional arguments:
  -h, --help            show this help message and exit
  --env_name ENV_NAME   Name of the environment.
  --interval INTERVAL   The interval specifies how often different parameters
                        should be saved and printed, counted by episodes.
  --do_train            The flag determines whether to train the agent or play with it.
  --train_from_scratch  The flag determines whether to train from scratch or continue previous tries.
  --seed SEED           The randomness' seed for torch, numpy, random & gym[env].
  • In order to train the agent with default arguments, execute the following command and use --do_train flag, otherwise the agent would be tested (You may change the environment and random seed based on your desire.):
python3 main.py --do_train --env_name="PongNoFrameskip-v4" --interval=200
  • If you want to keep training your previous run, execute the following (add --train_from_scratch flag):
python3 main.py --do_train --env_name="PongNoFrameskip-v4" --interval=200 --train_from_scratch

Pre-Trained Weights

  • There are pre-trained weights of the agents that were shown in the Results section playing, if you want to test them by yourself, please do the following:
  1. First extract your desired weight from *tar.xz format to get .pth extension then, rename your env_name + net_weights.pth file to net_weights.pth. For example: Breakout_net_weights.pth -> net_weights.pth
  2. Create a folder named Models in the root directory of the project and make sure it is empty.
  3. Create another folder with an arbitrary name inside Models folder. For example:
mkdir Models/ Models/temp_folder
  1. Put your net_weights.pth file in your temp_folder.
  2. Run above commands without using --do_tarin flag:
python3 main.py --env_name="PongNoFrameskip-v4"

Hardware Requirements

  • All runs with 8 parallel agents were carried out on paperspace.com [Free-GPU, 8 Cores, 30 GB RAM].
  • All runs with 2 parallel agents were carried out on Google Colab [CPU Runtime, 2 Cores, 12 GB RAM].

Tested Environments

  • PongNoFrameskip-v4
  • BreakoutNoFrameskip-v4
  • SpaceInvadersNoFrameskip-v4
  • AssaultNoFrameskip-v4

TODOs

  • Verify and add results of the Continuous version of ACER
  • Verify and add results of the Continuous version of A3C

Structure

.
├── Agent
│   ├── __init__.py
│   ├── memory.py
│   └── worker.py
├── LICENSE
├── main.py
├── NN
│   ├── __init__.py
│   ├── model.py
│   └── shared_optimizer.py
├── Pre-Trained Weights
│   ├── Breakout_net_weights.tar.xz
│   ├── PongACER_net_weights.tar.xz
│   ├── PongRecurrentA3C_net_weights.tar.xz
│   └── SpaceInvaders_net_weights.tar.xz
├── Readme files
│   ├── Gifs
│   │   ├── Breakout.gif
│   │   ├── PongACER.gif
│   │   ├── PongRecurrentA3C.gif
│   │   └── SpaceInvaders.gif
│   └── Plots
│       ├── Breakout_ep_len.png
│       ├── Breakout_reward.png
│       ├── PongACER_ep_len.png
│       ├── PongACER_reward.png
│       ├── PongRecurrentA3C_ep_len.png
│       ├── PongRecurrentA3C_reward.png
│       ├── SpaceInvaders_ep_len.png
│       └── SpaceInvaders_reward.png
├── README.md
├── requirements.txt
├── training_configs.yml
└── Utils
    ├── atari_wrappers.py
    ├── __init__.py
    ├── logger.py
    ├── play.py
    └── utils.py
  1. Agent package includes of the agent's specific configurations like its memory, thread-based functions, etc.
  2. NN package includes the Neural Network's structure and its optimizer settings.
  3. Utils includes minor codes that are common for most RL codes and do auxiliary tasks like logging, wrapping Atari environments and... .
  4. Pre-Trained Weights is the directory that pre-trained weights have been stored at.
  5. Gifs and plot images of the current Readme file lies at the Readme files directory.

Reference

  1. Sample Efficient Actor-Critic with Experience Replay, Wang, et al., 2016
  2. Asynchronous Methods for Deep Reinforcement Learning, Mnih et al., 2016
  3. OpenAI Baselines: ACKTR & A2C

Acknowledgement

Current code was inspired by following implementation especially the first one:

  1. acer by @OpenAI
  2. ACER by @Kaixhin
  3. acer by @dchetelat
  4. ACER_tf by @hercky

About

Implementation of ACER and A3C in PyTorch.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages