# RL Exercise 4 - Asynchronous Advantage Actor Critic

**GOAL:** The goal of this exercise is to demonstrate how to use the asynchronous advantage actor critic (A3C) algorithm.

A3C is described in detail in https://arxiv.org/abs/1602.01783.

In A3C, the driver maintains the most up-to-date policy. It creates a number of actors which are used to compute perform partial rollouts and to compute gradient updates to the model. The driver runs in a loop in which it waits for a single actor task to finish, updates the model with the result of the actor task, and launches a new actor task with the updated model. Because the actor tasks may run in any order, the algorithm is fundamentally asynchronous and non-deterministic.

In [1]:
import gym
import ray
from ray.rllib.a3c import A3CAgent, DEFAULT_CONFIG
from ray.rllib.a3c.shared_model import SharedModel

  from ._conv import register_converters as _register_converters


Instructions for updating:
Use the retry module or similar alternatives.


Start up Ray. This must be done before we instantiate any RL agents. We pass in `num_workers=0` because the training agent's constructor will create a number of actors.

In [2]:
ray.init(num_workers=0)

Waiting for redis server at 127.0.0.1:41555 to respond...
Waiting for redis server at 127.0.0.1:24990 to respond...
Starting local scheduler with the following resources: {'CPU': 8, 'GPU': 0}.

View the web UI at http://localhost:8896/notebooks/ray_ui89423.ipynb?token=771da28f2c984ddb9bbac909cb1f94c797451ebd66e209da



{'local_scheduler_socket_names': ['/tmp/scheduler60632477'],
 'node_ip_address': '127.0.0.1',
 'object_store_addresses': [ObjectStoreAddress(name='/tmp/plasma_store56774151', manager_name='/tmp/plasma_manager84457557', manager_port=31862)],
 'redis_address': '127.0.0.1:41555',
 'webui_url': 'http://localhost:8896/notebooks/ray_ui89423.ipynb?token=771da28f2c984ddb9bbac909cb1f94c797451ebd66e209da'}

Instantiate an A3CAgent object. We pass in a config object that specifies how the network and training procedure should be configured. Some of the parameters are the following.

- `num_workers` is the number of actors that the agent will create. This determines the degree of parallelism that will be used.
- `batch_size` is the number of simulator steps that each actor will batch together.

In [40]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 16
config['batch_size'] = 16

config_ = config.copy()
agent = A3CAgent(config, 'CartPole-v0')
agent_ = A3CAgent(config_, 'MountainCar-v0')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Observation shape is (4,)
Not using any observation preprocessor.
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Constructing fcnet [256, 256] <function tanh at 0x11fbfaf28>
Setting up loss
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Observation shape is (2,)
Not using any observation preprocessor.
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Constructing fcnet [256, 256] <function tanh at 0x11fbfaf28>
Setting up loss


**EXERCISE:** Train the agent for some number of steps on the CartPole environment. Compare the performance to PPO from the previous exercise.

In [41]:
for _ in range(100):
    result = agent.train()

  if np.issubdtype(value, float):
  if np.issubdtype(value, int):


**EXERCISE:** Instantiate an A3CAgent object on the `MountainCar-v0` environment and train it for some number of steps. Compare the performance to PPO from the previous exercise.

In [42]:
for _ in range(100):
    result_ = agent_.train()

  if np.issubdtype(value, float):
  if np.issubdtype(value, int):


In [43]:
print(
    result.episode_reward_mean,
    result_.episode_reward_mean,
    sep=72 * '-'+'\n',
)

137.51111111111112------------------------------------------------------------------------
-200.0
