<a href="https://colab.research.google.com/github/arinaruck/RL-2021/blob/main/a2c_optional.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import sys
if 'google.colab' in sys.modules:
    import os

    os.system('apt-get install -y xvfb')
    os.system('wget https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/xvfb -O ../xvfb')
    os.system('apt-get install -y python-opengl ffmpeg')
    os.system('pip install pyglet==1.2.4')

    os.system('python -m pip install -U pygame --user')

    print('setup complete')

# XVFB will be launched if you run on a server
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    os.environ['DISPLAY'] = ':1'

setup complete
Starting virtual X frame buffer: Xvfb.


In [None]:
!apt install subversion
!svn checkout https://github.com/yandexdataschool/Practical_RL/trunk/week06_policy_based/

Reading package lists... Done
Building dependency tree       
Reading state information... Done
subversion is already the newest version (1.9.7-4ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.
Checked out revision 4147.


In [None]:
sys.path.append('week06_policy_based')

# Implementing Advantage-Actor Critic (A2C)

In this notebook you will implement Advantage Actor Critic algorithm that trains on a batch of Atari 2600 environments running in parallel. 

Firstly, we will use environment wrappers implemented in file `atari_wrappers.py`. These wrappers preprocess observations (resize, grayscal, take max between frames, skip frames and stack them together) and rewards. Some of the wrappers help to reset the environment and pass `done` flag equal to `True` when agent dies.
File `env_batch.py` includes implementation of `ParallelEnvBatch` class that allows to run multiple environments in parallel. To create an environment we can use `nature_dqn_env` function. Note that if you are using 
PyTorch and not using `tensorboardX` you will need to implement a wrapper that will log **raw** total rewards that the *unwrapped* environment returns and redefine the implemention of `nature_dqn_env` function here. 



In [None]:
import numpy as np
from atari_wrappers import nature_dqn_env, NumpySummaries
import torch
import torch.nn as nn

nenvs = 8
env = nature_dqn_env("SpaceInvadersNoFrameskip-v4", nenvs=nenvs, summaries=False)
obs = env.reset()
assert obs.shape == (nenvs, 84, 84, 4)
assert obs.dtype == np.uint8

Next, we will need to implement a model that predicts logits and values. It is suggested that you use the same model as in [Nature DQN paper](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf) with a modification that instead of having a single output layer, it will have two output layers taking as input the output of the last hidden layer. **Note** that this model is different from the model you used in homework where you implemented DQN. You can use your favorite deep learning framework here. We suggest that you use orthogonal initialization with parameter $\sqrt{2}$ for kernels and initialize biases with zeros. 

In [None]:
def init_weights(m):
    sqrt_2 = np.sqrt(2)
    if type(m) == nn.Conv2d:
        torch.nn.init.orthogonal_(m.weight, gain=sqrt_2)
        m.bias.data.fill_(0)
        
class CAAgent(nn.Module):
    def __init__(self, n_actions):

        super().__init__()
        self.n_actions = n_actions

        # Define your network body here. Please make sure agent is fully contained here
        # nn.Flatten() can be useful
        hidden = 512
        self.encode = nn.Sequential(
            nn.Conv2d(4, 32, 8, 4), 
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, 2), 
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, 1), 
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(3136, hidden),
            nn.ReLU()
        )
        self.policy = nn.Linear(hidden, n_actions)
        self.v = nn.Sequential(
                    nn.Linear(hidden, hidden // 4),
                    nn.LeakyReLU(0.2),
                    nn.Linear(hidden // 4, 1)
        )
        self.encode.apply(init_weights)

    def forward(self, state_t):
        print("I'm in forward!")
        state_t = torch.as_tensor(state_t.transpose(0, 3, 1, 2), dtype=torch.float32)
        encoded = self.encode(state_t)
        policy = self.policy(encoded)
        values = self.v(encoded)

        return policy, values

You will also need to define and use a policy that wraps the model. While the model computes logits for all actions, the policy will sample actions and also compute their log probabilities.  `policy.act` should return a dictionary of all the arrays that are needed to interact with an environment and train the model.
 Note that actions must be an `np.ndarray` while the other
tensors need to have the type determined by your deep learning framework. 

In [None]:
class Policy:
    def __init__(self, model):
        self.model = model
        self.actions = np.arange(model.n_actions)
    
    def act(self, inputs):
        logits, values = self.model(inputs)
        probs = nn.functional.softmax(logits, -1)
        log_probs = nn.functional.log_softmax(logits, -1)
        batch_size = inputs.shape[0]
        actions = np.zeros(batch_size)
        for i in range(batch_size):
          actions[i] = np.random.choice(self.actions, p=probs[i].detach().numpy())
        print(f'actions shape: {actions.shape}')
        return {'actions': actions, 'logits': logits, 
                'log_probs': log_probs, 'values': values}

Next will pass the environment and policy to a runner that collects partial trajectories from the environment. 
The class that does is is already implemented for you.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from runners import EnvRunner

This runner interacts with the environment for a given number of steps and returns a dictionary containing
keys 

* 'observations' 
* 'rewards' 
* 'resets'
* 'actions'
* all other keys that you defined in `Policy`

under each of these keys there is a python `list` of interactions with the environment of specified length $T$ &mdash; the size of partial trajectory. 

To train the part of the model that predicts state values you will need to compute the value targets. 
Any callable could be passed to `EnvRunner` to be applied to each partial trajectory after it is collected. 
Thus, we can implement and use `ComputeValueTargets` callable. 
The formula for the value targets is simple:

$$
\hat v(s_t) = \left( \sum_{t'=0}^{T - 1 - t} \gamma^{t'}r_{t+t'} \right) + \gamma^T \hat{v}(s_{t+T}),
$$

In implementation, however, do not forget to use 
`trajectory['resets']` flags to check if you need to add the value targets at the next step when 
computing value targets for the current step. You can access `trajectory['state']['latest_observation']`
to get last observations in partial trajectory &mdash; $s_{t+T}$.

In [None]:
class ComputeValueTargets:
    def __init__(self, policy, gamma=0.99):
        self.policy = policy
        self.gamma = gamma
    
    def __call__(self, trajectory):
        # This method should modify trajectory inplace by adding
        # an item with key 'value_targets' to it.
        rewards = trajectory['rewards']
        resets = trajectory['resets']
        n = len(rewards)
        _, init_v = self.policy.model(trajectory['state']['latest_observation']).detach()
        g = [self.gamma * init_v + rewards[-1]]
        for i in range(1, n):
            v_prev = 0 if resets[-(i+1)] else g[-1]
            g.append(self.gamma *  v_prev + rewards[-(i + 1)])
        trajectory['value_targets'] = g[::-1]

After computing value targets we will transform lists of interactions into tensors
with the first dimension `batch_size` which is equal to `T * nenvs`, i.e. you essentially need
to flatten the first two dimensions. 

In [None]:
class MergeTimeBatch:
    """ Merges first two axes typically representing time and env batch. """
    def __call__(self, trajectory):
        # Modify trajectory inplace.
        for k, v in trajectory.items():
            _, _, dims = trajectory[k].shape
            print(k, trajectory[k])
            trajectory[k] = torch.as_tensor(trajectory[k]).view(-1, *dims)

In [None]:
model = CAAgent(env.action_space.n)
policy = Policy(model)
runner = EnvRunner(
    env, policy, nsteps=5,
    transforms=[
        ComputeValueTargets(policy),
        MergeTimeBatch(),
    ])

Now is the time to implement the advantage actor critic algorithm itself. You can look into your lecture,
[Mnih et al. 2016](https://arxiv.org/abs/1602.01783) paper, and [lecture](https://www.youtube.com/watch?v=Tol_jw5hWnI&list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37&index=20) by Sergey Levine.

In [None]:
class A2C:
    def __init__(self,
                 policy,
                 optimizer,
                 value_loss_coef=0.25,
                 entropy_coef=0.01,
                 max_grad_norm=0.5):
        self.policy = policy
        self.optimizer = optimizer
        self.value_loss_coef = value_loss_coef
        self.entropy_coef = entropy_coef
        self.max_grad_norm = max_grad_norm
        self.mse = nn.MSELoss()
    
    def policy_loss(self, trajectory):
        n_actions = self.policy.model.n_actions
        policy, values = self.policy.model(trajectory['observations'])
        log_probs = trajectory['log_probs']
        probs = trajectory['probs']
        actions = trajectory['actions']
        entropy = -(probs * log_probs).sum(dim=1)
        log_probs_for_actions = torch.sum( 
            log_probs * to_one_hot(actions, n_actions), dim=1)
        p_loss = -(log_probs_for_actions * (cumulative_returns - values) + self.entropy_coef * entropy).mean()
        return p_loss
    
    def value_loss(self, trajectory):
        targets = trajectory['value_targets']
        _, values = self.policy.model(trajectory['observations'])
        return self.mse(values, targets)
    
    def loss(self, trajectory):
        return self.policy_loss(trajectory) + self.value_loss_coef * self.value_loss(trajectory)
      
    def step(self, trajectory):
        self.optimizer.zero_grad()
        loss = self.loss(trajectory)
        loss.backward()
        nn.clip_grad_value_(self.policy.model.parameters(), self.max_grad_norm)

Now you can train your model. With reasonable hyperparameters training on a single GTX1080 for 10 million steps across all batched environments (which translates to about 5 hours of wall clock time)
it should be possible to achieve *average raw reward over last 100 episodes* (the average is taken over 100 last 
episodes in each environment in the batch) of about 600. You should plot this quantity with respect to 
`runner.step_var` &mdash; the number of interactions with all environments. It is highly 
encouraged to also provide plots of the following quantities (these are useful for debugging as well):

* [Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) between 
value targets and value predictions
* Entropy of the policy $\pi$
* Value loss
* Policy loss
* Value targets
* Value predictions
* Gradient norm
* Advantages
* A2C loss

For optimization we suggest you use RMSProp with learning rate starting from 7e-4 and linearly decayed to 0, smoothing constant (alpha in PyTorch and decay in TensorFlow) equal to 0.99 and epsilon equal to 1e-5.

In [2]:
lr = 7e-4
n_steps = 100
lr_step = lr / n_steps
optimizer = torch.optim.RMSprop(model.parameters(), lr, alpha=0.99, eps=1e-5)
a2c = A2C(policy, optimizer)
for i in range(n_steps):
    trajectory = runner.get_next()
    a2c.step(trajectory)
    optimizer.lr -= lr_step
