In [1]:
import sys, os
if 'google.colab' in sys.modules and not os.path.exists('.setup_complete'):
    # Install xvfb and our launcher script for it
    !apt-get install -y xvfb
    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/xvfb -O ../xvfb

    # Download dependencies from Github
    !wget https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/week06_policy_based/atari_wrappers.py
    !wget https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/week06_policy_based/env_batch.py
    !wget https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/week06_policy_based/runners.py

    # Update the gym environment to be compatible with the Atari environment
    !pip install -q gymnasium[atari,accept-rom-license]
    !pip install -q tensorboardX

    !touch .setup_complete

# This code creates a virtual display to draw game images on.
# It will have no effect if your machine has a monitor.
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    os.environ['DISPLAY'] = ':1'

bash: ../xvfb: No such file or directory


In [2]:
!pip install -q gymnasium[atari,accept-rom-license]
!pip install -q tensorboardX

zsh:1: no matches found: gymnasium[atari,accept-rom-license]


# Implementing Advantage-Actor Critic (A2C)

In this notebook you will implement Advantage Actor Critic algorithm that trains on a batch of Atari 2600 environments running in parallel.

Firstly, we will use environment wrappers implemented in file `atari_wrappers.py`. These wrappers preprocess observations (resize, grayscale, take max between frames, skip frames and stack them together) and rewards. Some of the wrappers help to reset the environment and pass `done` flag equal to `True` when agent dies.
File `env_batch.py` includes implementation of `ParallelEnvBatch` class that allows to run multiple environments in parallel. To create an environment we can use `nature_dqn_env` function. Note that if you are using
PyTorch and not using `tensorboardX` you will need to implement a wrapper that will log **raw** total rewards that the *unwrapped* environment returns and redefine the implemention of `nature_dqn_env` function here.



In [3]:
import numpy as np
import gymnasium as gym
from atari_wrappers import nature_dqn_env


env_name = "SpaceInvadersNoFrameskip-v4"
nenvs = 8  # change this if you have more than 8 CPU ;)
summaries = "Tensorboard"

env = nature_dqn_env(env_name, nenvs=nenvs, summaries=summaries)
obs, _ = env.reset()
assert obs.shape == (nenvs, 4, 84, 84)
assert obs.dtype == np.float32


A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]


Next, we will need to implement a model that predicts logits and values. It is suggested that you use the same model as in [Nature DQN paper](https://www.nature.com/articles/nature14236) with a modification that instead of having a single output layer, it will have two output layers taking as input the output of the last hidden layer. **Note** that this model is different from the model you used in homework where you implemented DQN. You can use your favorite deep learning framework here. We suggest that you use orthogonal initialization with parameter $\sqrt{2}$ for kernels and initialize biases with zeros.

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

In [5]:
N_ACTIONS, N_FRAMES_STACKED = env.action_space.n, 4

In [6]:
N_ACTIONS

6

In [7]:
class ActorCritic(torch.nn.Module):
    def __init__(self, c_in: int = N_FRAMES_STACKED, n_actions: int = N_ACTIONS, hidden_size: int = 512):
        super(ActorCritic, self).__init__()
        
        self.conv = nn.Sequential(nn.Conv2d(in_channels=N_FRAMES_STACKED, out_channels=32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.Flatten(),
        )

        
        self.critic_head = nn.Sequential(
            nn.LazyLinear(hidden_size),
            nn.ReLU(),
            nn.LazyLinear(1),
        )
        
        self.actor_head = nn.Sequential(
            nn.LazyLinear(hidden_size),
            nn.ReLU(),
            nn.LazyLinear(n_actions),
        )

        #self.apply(weights_init)

    def forward(self, x) -> tuple[torch.Tensor]:
        x = torch.FloatTensor(np.array(x))
        x = self.conv(x)
        return self.critic_head(x), self.actor_head(x)

You will also need to define and use a policy that wraps the model. While the model computes logits for all actions, the policy will sample actions and also compute their log probabilities.  `policy.act` should return a dictionary of all the arrays that are needed to interact with an environment and train the model.
 Note that actions must be an `np.ndarray` while the other
tensors need to have the type determined by your deep learning framework.

In [8]:
class Policy:
    def __init__(self, model):
        self.model = model

    def act(self, inputs):
        # Implement a policy by calling the model, sampling actions and computing their log probs.
        # Should return a dict containing keys ['actions', 'logits', 'log_probs', 'values'].
        values, logits = self.model(inputs)
        probs = F.softmax(logits, dim=1)
        res = {}
        res['values'] = values
        res['logits'] = logits
        res['log_probs'] = F.log_softmax(logits, dim=1)
        res['probs'] = probs
        res['actions'] = torch.multinomial(probs, num_samples=1, replacement=True).detach().numpy().reshape(-1)
        return res

Next will pass the environment and policy to a runner that collects partial trajectories from the environment.
The class that does is is already implemented for you.

In [9]:
from runners import EnvRunner

This runner interacts with the environment for a given number of steps and returns a dictionary containing
keys

* 'observations'
* 'rewards'
* 'resets'
* 'actions'
* all other keys that you defined in `Policy`

under each of these keys there is a python `list` of interactions with the environment. This list has length $T$ that is size of partial trajectory. Partial trajectory for given moment `t` is part of `ComputeValueTargets.__call__` input argument `trajectory` from moment `t` to the end (i.e. it's different at each iteration in the algorithm).

To train the part of the model that predicts state values you will need to compute the value targets.
Any callable could be passed to `EnvRunner` to be applied to each partial trajectory after it is collected.
Thus, we can implement and use `ComputeValueTargets` callable.
The formula for the value targets is simple:

$$
\hat v(s_t) = \left( \sum_{t'=0}^{T - 1} \gamma^{t'}r_{t+t'} \right) + \gamma^T \hat{v}(s_{t+T}),
$$

In implementation, however, do not forget to use
`trajectory['resets']` flags to check if you need to add the value targets at the next step when
computing value targets for the current step. You can access `trajectory['state']['latest_observation']`
to get last observations in partial trajectory &mdash; $s_{t+T}$.

In [10]:
class ComputeValueTargets:
    def __init__(self, policy, gamma=0.99):
        self.policy = policy
        self.gamma = gamma

    def __call__(self, trajectory):
        """Compute value targets for a given partial trajectory."""

        # This method should modify trajectory inplace by adding
        # an item with key 'value_targets' to it.
        rewards = torch.tensor(np.array(trajectory["rewards"]))
        done = torch.where(torch.tensor(np.array(trajectory["resets"])) == True, 1, 0)
        value_targets = torch.zeros(rewards.shape)
        
        value_last = (1 - done[-1])*self.policy.act(torch.tensor(trajectory['state']['latest_observation']))['values'].reshape(-1)
        T = len(trajectory["rewards"])
        value_targets[T-1] = value_last
        for i in reversed(range(T-1)):
            value_last = rewards[i] + (1-done[i])*self.gamma*value_last
            value_targets[i] = value_last
        trajectory['value_targets'] = value_targets

After computing value targets we will transform lists of interactions into tensors
with the first dimension `batch_size` which is equal to `env_steps * num_envs`, i.e. you essentially need
to flatten the first two dimensions.

In [11]:
class MergeTimeBatch:
    """ Merges first two axes typically representing time and env batch. """
    def __call__(self, trajectory):
        # Modify trajectory inplace.
        batch_size = runner.nsteps * runner.nenvs
        for key, value in trajectory.items():
            if key not in ['actions', 'logits', 'log_probs', 'probs', 'values', 'value_targets']:
                continue
            if isinstance(value, torch.Tensor):
                trajectory[key] = torch.flatten(value, 0, 1)
            if isinstance(value, list):
                if isinstance(value[0], torch.Tensor): 
                    trajectory[key] = torch.concatenate(value).squeeze()
                if isinstance(value[0], np.ndarray):
                    value = np.stack(value).squeeze()
                    trajectory[key] = np.reshape(value, (batch_size, *value.shape[2:]))

In [12]:
model = ActorCritic()
policy = Policy(model)
runner = EnvRunner(
    env=env,
    policy=policy,
    nsteps=5,
    transforms=[
        ComputeValueTargets(policy),
        MergeTimeBatch(),
    ],
)




In [13]:
trajectory = runner.get_next()
for key, value in trajectory.items():
    print(f"key = {key}", end='\t')
    if isinstance(value, (torch.Tensor, np.ndarray)): 
        print(f"{value.shape}")
    else:
        print(type(value))

key = actions	(40,)
key = values	torch.Size([40])
key = logits	torch.Size([40, 6])
key = log_probs	torch.Size([40, 6])
key = probs	torch.Size([40, 6])
key = observations	<class 'list'>
key = rewards	<class 'list'>
key = resets	<class 'list'>
key = state	<class 'dict'>
key = value_targets	torch.Size([40])


Now is the time to implement the advantage actor critic algorithm itself. You can look into your lecture,
[Mnih et al. 2016](https://arxiv.org/abs/1602.01783) paper, and [lecture](https://www.youtube.com/watch?v=Tol_jw5hWnI&list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37&index=20) by Sergey Levine.

In [14]:
class A2C:
    def __init__(self,
                 policy,
                 optimizer,
                 value_loss_coef=0.25,
                 entropy_coef=0.01,
                 max_grad_norm=0.5):
        self.policy = policy
        self.optimizer = optimizer
        self.value_loss_coef = value_loss_coef
        self.entropy_coef = entropy_coef
        self.max_grad_norm = max_grad_norm

    def policy_loss(self, trajectory):
        # You will need to compute advantages here.
        actions = trajectory['actions']
        log_probs = trajectory['log_probs']
        probs = trajectory['probs']
        values = trajectory['values']
        values_target = trajectory['value_targets']
        
        advatages = (values_target - values).detach()
        entropy = -(probs * log_probs).sum(1).mean()
        actions_probs = log_probs[range(len(actions)), actions]
        loss = -(actions_probs * advatages).mean() - self.entropy_coef * entropy
        return loss, entropy, advatages.mean()
        
        
        return policy_loss / len(trajectory['probs'])
    
    def value_loss(self, trajectory):
        return nn.MSELoss()(torch.tensor(trajectory['value_targets']), trajectory['values'])
        

    def loss(self, trajectory):
        loss = self.value_loss_coef*self.value_loss(trajectory) + self.policy_loss(trajectory)[0]
        return loss

    def step(self, trajectory):
        optimizer.zero_grad()
        loss = self.loss(trajectory)
        loss.backward()
        torch.nn.utils.clip_grad_norm(self.policy.model.parameters(), self.max_grad_norm)
        optimizer.step()

Now you can train your model. With reasonable hyperparameters training on a single GTX1080 for 10 million steps across all batched environments (which translates to about 5 hours of wall clock time)
it should be possible to achieve *average raw reward over last 100 episodes* (the average is taken over 100 last
episodes in each environment in the batch) of about 600. You should plot this quantity with respect to
`runner.step_var` &mdash; the number of interactions with all environments. It is highly
encouraged to also provide plots of the following quantities (these are useful for debugging as well):

* [Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) between
value targets and value predictions
* Entropy of the policy $\pi$
* Value loss
* Policy loss
* Value targets
* Value predictions
* Gradient norm
* Advantages
* A2C loss

For optimization we suggest you use RMSProp with learning rate starting from 7e-4 and linearly decayed to 0, smoothing constant (alpha in PyTorch and decay in TensorFlow) equal to 0.99 and epsilon equal to 1e-5.

In [15]:
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter("logs")

In [22]:
#if you use TensorboardSummaries
%load_ext tensorboard
%tensorboard --logdir logs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 60268), started 0:00:02 ago. (Use '!kill 60268' to kill it.)

In [17]:
from atari_wrappers import NumpySummaries
NumpySummaries.clear()

In [18]:
from torch.optim import lr_scheduler

model = ActorCritic()
policy = Policy(model)
runner = EnvRunner(
    env=env,
    policy=policy,
    nsteps=5,
    transforms=[
        ComputeValueTargets(policy),
        MergeTimeBatch(),
    ],
)

optimizer = torch.optim.RMSprop(policy.model.parameters(), lr=7e-4, eps=1e-05)
a2c = A2C(policy, optimizer)
scheduler = lr_scheduler.StepLR(optimizer, step_size=10000, gamma=0.95)



In [19]:
from tqdm import tqdm

In [20]:
state = env.reset()
step = 0
total_steps = 10 ** 7

with tqdm(range(step, total_steps + 1)) as progress_bar:
    for step in progress_bar:
        # play
        trajectory = runner.get_next()


        #NumpySummaries.set_step(runner.step_var)
        # train
        if step % 100 == 0:
            policy_loss, entropy, advatages = a2c.policy_loss(trajectory)
            value_loss = a2c.value_loss(trajectory)
            loss_ = a2c.loss(trajectory)
            
            env.add_summary('mean_reward', torch.tensor(np.array(trajectory["rewards"])).mean().item())
            env.add_summary('policy_loss', policy_loss.item())
            env.add_summary('value_loss', value_loss.item())
            env.add_summary('loss', loss_.item())
            env.add_summary('entropy', entropy.item())
            env.add_summary('advatages', advatages.item())

            mean_reward = NumpySummaries.get_values("SpaceInvadersNoFrameskip-v4/reward_mean_100")
            policy_loss_history = NumpySummaries.get_values("policy_loss")
            value_loss_history = NumpySummaries.get_values("value_loss")
            loss_history = NumpySummaries.get_values("loss")
            entropy_history = NumpySummaries.get_values("entropy")
            advatages_history = NumpySummaries.get_values("advatages")

        a2c.step(trajectory)
        scheduler.step()
        
#         with torch.no_grad():
#             if step % 1000 == 0:
#                 print(mean_reward)
#                 writer.add_scalar("mean_reward", mean_reward, step)
#                 writer.add_scalar("policy_loss_history", policy_loss_history, step)
#                 writer.add_scalar("value_loss_history", value_loss_history, step)
#                 writer.add_scalar("loss_history", loss_history, step)
#                 writer.add_scalar("entropy_history", entropy_history, step)
#                 writer.add_scalar("advatages_history", advatages_history, step)

  return nn.MSELoss()(torch.tensor(trajectory['value_targets']), trajectory['values'])
  logger.warn(
  torch.nn.utils.clip_grad_norm(self.policy.model.parameters(), self.max_grad_norm)
  5%|█▎                        | 500978/10000001 [12:06:33<191:18:43, 13.79it/s]Process Process-8:
Process Process-7:
Process Process-6:
Process Process-5:
Process Process-4:
Process Process-2:
Process Process-1:
Process Process-3:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/dmit-vuk/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/dmit-vuk/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/dmit-vuk/Desktop/RL/Actor Critic/env_batch.py", line 148, in worker
    cmd, data = worker_connecti

  5%|█▎                        | 500978/10000001 [12:06:33<229:36:18, 11.49it/s]


KeyboardInterrupt: 

Получилось выбить награду ~638!

### Target networks?

You may recall a technique called "target networks" we used a few weeks ago when we trained a DQN agent to play Atari Breakout and wonder why we have not suggested using them here. The answer is that this is more historical than practical.

While the "chasing the target" problem is still present in actor-critic value estimation and target networks do show up in follow-up papers, the original A3C/A2C papers do not mention them and do not explain this omission.

The hypothesis why this may not be a big deal (compared to Q-learning) goes like this. An A3C/A2C agent selects actions based on policy, not an epsilon greedy exploration function, for which the argmax can change drastically due to tiny errors in function approximation. Therefore, errors in the value target caused by target chasing will cause less damage.

Also, the actor-critic gradient relies on the advantage function $A(s_t, a_t) = Q(s_t, a_t) - V(s_t)$. Compare this to the $Q$-function $Q(s_t, a_t) = r(s_t, a_t) + \gamma \cdot \mathbb{E}_{s_{t+1} \mid s_t, a_t} V(s_{t+1})$ used in Q-learning and SARSA: we would expect that any bias in $V$-function approximation will be carried over from $V(s_{t+1})$ to $V(s_t)$ by gradient updates. However, in the formula for the advantage function the two approximations ($Q$-function and $V$-function) come with opposite signs, and thus the errors will cancel out.

The last reason may be computational. Authors were concerned to beat existent algorithms in the wall-clock learning time, and any overhead of parameter copying (target network update) counted against this goal.