# Implementing Proximal Policy Optimization 


In this notebook you will be implementing Proximal Policy Optimization algorithm, 
scaled up version of which was used to train [OpenAI Five](https://openai.com/blog/openai-five/) 
to [win](https://openai.com/blog/how-to-train-your-openai-five/) against the
world champions in Dota 2.
You will be solving a continuous control environment on which it may be easier and faster 
to train an agent, however note that PPO here may not be the best algorithm as, for example,
Deep Deterministic Policy Gradient and Soft Actor Critic may be more suited 
for continuous control environments. To run the environment you will need to install 
[pybullet-gym](https://github.com/benelot/pybullet-gym) which unlike MuJoCo 
does not require you to have a license.

To install the library:

The overall structure of the code is similar to the one in the A2C optional homework, but don't worry if you haven't done it, it should be relatively easy to figure it out. 
First, we will create an instance of the environment. 
We will normalize the observations and rewards, but before that you will need a wrapper that will 
write summaries, mainly, the total reward during an episode. You can either use one for `TensorFlow` 
implemented in `atari_wrappers.py` file from the optional A2C homework, or implement your own. 

In [1]:
import gym
import gym_interf
from env_batch import ParallelEnvBatch
import numpy as np
import torch

def make_interf_env(seed):
    env = gym.make('interf-v1')
    env.set_calc_reward('delta_visib')
    #env.set_calc_image('gpu')
    env.seed(seed)
    return env


def make_env(nenvs):
    seed = list(range(nenvs))
    env = ParallelEnvBatch([
        lambda env_seed=env_seed: make_interf_env(seed=env_seed)
        for env_seed in seed
    ])
    return env


N_ENVS = 8
N_STEPS = 20
env = make_env(nenvs=N_ENVS)
obs = env.reset()
#print(obs)
N_ACTIONS = env.action_space.n
OBS_SHAPE = obs.shape
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(DEVICE)
assert obs.shape == (8, 16, 64, 64), obs.shape
assert obs.dtype == np.uint8

cuda


Process Process-8:
Process Process-7:
Process Process-2:
Process Process-6:
Traceback (most recent call last):
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
Traceback (most recent call last):
Process Process-5:
Process Process-4:
Process Process-3:
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/pytho

In [2]:
import gym

class ObserwationWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
    
    def observation(self, state):
        return torch.tensor(state).float().to(DEVICE) / 255.0

env = ObserwationWrapper(env)


In [3]:
env.step([0] * N_ENVS)

(tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.],
           ...,
           [0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.]],
 
          [[0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.],
           ...,
           [0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.]],
 
          [[0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.],
           ...,
           [0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.]],
 
          ...,
 
          [[0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.],
           [0., 0., 0.,  ..., 0., 0., 0.],
           ...,
       

The normalization wrapper will subtract running mean from observations and rewards and divide 
the resulting quantities by the  running variances.

Next, you will need to define a model for training. We suggest that you use two separate networks: one for policy
and another for value function. Each network should be a 3-layer MLP with 64 hidden units, $\mathrm{tanh}$ 
activation function, kernel matrices initialized with orthogonal initializer with parameter $\sqrt{2}$
and biases initialized with zeros. 

Our policy distribution is going to be multivariate normal with diagonal covariance. 
The network from above will predict the mean, and the covariance should be represented by a single 
(learned) vector of size 6 (corresponding to the dimensionality of the action space from above). 
You should initialize this vector to zero and take the exponent of it to always
have a non-negative quantity. 

Overall the model should return three things: predicted mean of the distribution, variance vector, 
value function. 

In [4]:
import numpy as np
import torch
import torch.nn as nn
from torch.autograd import Variable


def ortho_weights(shape, scale=1.):
    """ PyTorch port of ortho_init from baselines.a2c.utils """
    shape = tuple(shape)

    if len(shape) == 2:
        flat_shape = shape[1], shape[0]
    elif len(shape) == 4:
        flat_shape = (np.prod(shape[1:]), shape[0])
    else:
        raise NotImplementedError

    a = np.random.normal(0., 1., flat_shape)
    u, _, v = np.linalg.svd(a, full_matrices=False)
    q = u if u.shape == flat_shape else v
    q = q.transpose().copy().reshape(shape)

    if len(shape) == 2:
        return torch.from_numpy((scale * q).astype(np.float32))
    if len(shape) == 4:
        return torch.from_numpy((scale * q[:, :shape[1], :shape[2]]).astype(np.float32))


def atari_initializer(module):
    """ Parameter initializer for Atari models
    Initializes Linear, Conv2d, and LSTM weights.
    """
    classname = module.__class__.__name__

    if classname == 'Linear':
        module.weight.data = ortho_weights(module.weight.data.size(), scale=np.sqrt(2.))
        module.bias.data.zero_()

    elif classname == 'Conv2d':
        module.weight.data = ortho_weights(module.weight.data.size(), scale=np.sqrt(2.))
        module.bias.data.zero_()

    elif classname == 'LSTM':
        for name, param in module.named_parameters():
            if 'weight_ih' in name:
                param.data = ortho_weights(param.data.size(), scale=1.)
            if 'weight_hh' in name:
                param.data = ortho_weights(param.data.size(), scale=1.)
            if 'bias' in name:
                param.data.zero_()
                
 
def conv2d_size_out(size, kernel_size, stride):
    """
    common use case:
    cur_layer_img_w = conv2d_size_out(cur_layer_img_w, kernel_size, stride)
    cur_layer_img_h = conv2d_size_out(cur_layer_img_h, kernel_size, stride)
    to understand the shape for dense layer's input
    """
    return (size - (kernel_size - 1) - 1) // stride  + 1


class Flatten(nn.Module):
    def __init__(self):
        super().__init__()
        
    def forward(self, x):
        return x.view(x.size(0), -1)


class AtariCNN(nn.Module):
    def __init__(self, num_actions, state_shape):
        """ Basic convolutional actor-critic network for Atari 2600 games
        Equivalent to the network in the original DQN paper.
        Args:
            num_actions (int): the number of available discrete actions
        """
        super().__init__()

        self.conv = nn.Sequential(nn.Conv2d(in_channels=state_shape[1], out_channels=32, kernel_size=8, stride=4),
                                  nn.ReLU(),
                                  nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
                                  nn.ReLU(),
                                  nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
                                  nn.ReLU(),
                                  Flatten())
        
        convw = conv2d_size_out(state_shape[2], kernel_size=8, stride=4)
        convw = conv2d_size_out(convw, kernel_size=4, stride=2)
        convw = conv2d_size_out(convw, kernel_size=3, stride=1)
        
        convh = conv2d_size_out(state_shape[3], kernel_size=8, stride=4)
        convh = conv2d_size_out(convh, kernel_size=4, stride=2)
        convh = conv2d_size_out(convh, kernel_size=3, stride=1)
       
        linear_input_size = convw * convh * 64
        

        self.fc = nn.Sequential(nn.Linear(linear_input_size, 512),
                                nn.ReLU())

        self.pi = nn.Linear(512, num_actions)
        self.v = nn.Linear(512, 1)
        
        #std = 0
        #self.log_std = nn.Parameter(torch.ones(1, num_actions) * std)

        self.num_actions = num_actions

        # parameter initialization
        self.apply(atari_initializer)
        self.pi.weight.data = ortho_weights(self.pi.weight.size(), scale=.01)
        self.v.weight.data = ortho_weights(self.v.weight.size())

    def forward(self, x):
        """ Module forward pass
        Args:
            x (Variable): convolutional input, shaped [batch_size x 4 x 84 x 84]
        Returns:
            pi (Variable): action probability logits, shaped [batch_size x self.num_actions]
            v (Variable): value predictions, shaped [batch_size x 1]
        """
        
        conv_out = self.conv(x)
        fc_out = self.fc(conv_out)

        mu = self.pi(fc_out)
        #std = self.log_std.exp().expand_as(mu)
        value = self.v(fc_out)

        return mu, 0, value


In [5]:
model = AtariCNN(N_ACTIONS, OBS_SHAPE).to(DEVICE)
s = env.reset()
print(s.shape)
mu, std, value = model(s)
print(value[0][0])


torch.Size([8, 16, 64, 64])
tensor(-0.0608, device='cuda:0', grad_fn=<SelectBackward>)


This model will be wrapped by a `Policy`. The policy can work in two modes, but in either case 
it is going to return dictionary with string-type keys. The first mode is when the policy is 
used to sample actions for a trajectory which will later be used for training. In this case 
the flag `training` passed to `act` method is `False` and the method should return 
a `dict` with the following keys: 

* `"actions"`: actions to pass to the environment
* `"log_probs"`: log-probabilities of sampled actions
* `"values"`: value function $V^\pi(s)$ predictions.

We don't need to use the values under these keys for training, so all of them should be of type `np.ndarray`.

When `training` is `True`, the model is training on a given batch of observations. In this
case it should return a `dict` with the following keys

* `"distribution"`: an instance of multivariate normal distribution (`torch.distributions.MultivariateNormal` or `tf.distributions.MultivariateNormalDiag`)
* `"values"`: value function $V^\pi(s)$ prediction.

The distinction about the modes comes into play depending on where the policy is used: if it is called from `EnvRunner`, 
the `training` flag is `False`, if it is called from `PPO`, the `training` flag is `True`. These classed 
will be described below. 

In [6]:
from torch.nn import functional as F

class Policy:
  def __init__(self, model):
    self.model = model
    
  def act(self, inputs, training = False, determ = False):
    logits, _, values = self.model(inputs)
    dist = torch.distributions.Categorical(logits=logits)
    
    if training:
        return dist, values
        
    else:
        if determ:
            actions = dist.probs.argmax(dim=1, keepdim=True)
        else:
            actions = dist.sample()
        return {
            'actions': actions.detach().cpu().numpy(),
            'log_probs': dist.log_prob(actions).detach(),
            'values': values.detach().cpu().numpy().reshape(-1)
        }


In [7]:
m = AtariCNN(N_ACTIONS, OBS_SHAPE).to(DEVICE)
p = Policy(m)
s = env.reset()
act = p.act(s)
a = act['actions']
lp = act['log_probs']
print('actions', act['actions'].shape)
print('log_probs', lp[0])
#env.step(a)

actions (8,)
log_probs tensor(-2.0787, device='cuda:0')


We will use `EnvRunner` to perform interactions with an environment with a policy for a fixed number of timesteps. Calling `.get_next()` on a runner will return a trajectory &mdash; dictionary 
containing keys

* `"observations"`
* `"rewards"` 
* `"resets"`
* `"actions"`
* all other keys that you defined in `Policy`,

under each of these keys there is a `np.ndarray` of specified length $T$ &mdash; the size of partial trajectory. 

Additionally, before returning a trajectory this runner can apply a list of transformations. 
Each transformation is simply a callable that should modify passed trajectory in-place.

In [8]:
""" RL env runner """
from collections import defaultdict
import numpy as np


class EnvRunner:
  """ Reinforcement learning runner in an environment with given policy """
  def __init__(self, env, policy, nsteps,
               transforms=None, step_var=None):
    self.env = env
    self.policy = policy
    self.nsteps = nsteps
    self.transforms = transforms or []
    self.step_var = step_var if step_var is not None else 0
    self.state = {"latest_observation": self.env.reset()}


  @property
  def nenvs(self):
    """ Returns number of batched envs or `None` if env is not batched """
    return getattr(self.env.unwrapped, "nenvs", None)

  def reset(self):
    """ Resets env and runner states. """
    self.state["latest_observation"] = self.env.reset()
    self.policy.reset()

  def get_next(self):
    """ Runs the agent in the environment.  """
    trajectory = defaultdict(list, {"actions": []})
    observations = []
    rewards = []
    resets = []
    self.state["env_steps"] = self.nsteps

    for i in range(self.nsteps):
      observations.append(self.state["latest_observation"])
      act = self.policy.act(self.state["latest_observation"])
      if "actions" not in act:
        raise ValueError("result of policy.act must contain 'actions' "
                         f"but has keys {list(act.keys())}")
      for key, val in act.items():
        trajectory[key].append(val)

      obs, rew, done, _ = self.env.step(trajectory["actions"][-1])
      self.state["latest_observation"] = obs
      rewards.append(rew)
      resets.append(done)
      self.step_var += self.nenvs or 1

      # Only reset if the env is not batched. Batched envs should auto-reset.
      if not self.nenvs and np.all(done):
        self.state["env_steps"] = i + 1
        self.state["latest_observation"] = self.env.reset()

    trajectory.update(observations=observations, rewards=rewards, resets=resets)
    trajectory["state"] = self.state

    for transform in self.transforms:
      transform(trajectory)
    return trajectory


In [9]:
class AsArray:
  """ 
  Converts lists of interactions to ndarray.
  """
  def __call__(self, trajectory):
    # Modify trajectory inplace. 
    for k, v in filter(lambda kv: kv[0] != "state" and kv[0] != "observations" and kv[0] != "log_probs",
                       trajectory.items()):
      trajectory[k] = np.asarray(v)

You will need to implement the following two transformations. 

The first is `GAE` that implements [Generalized Advantage Estimator](https://arxiv.org/abs/1506.02438).
In it you should add two keys to the trajectory: `"advantages"` and `"value_targets"`. In GAE the advantages
$A_t^{\mathrm{GAE}(\gamma,\lambda)}$ are essentially defined as the exponential 
moving average with parameter $\lambda$ of the regular advantages 
$\hat{A}^{(n)}(s_t) = \sum_{l=0}^{T-1} \gamma^l r_{t+l} + \gamma^{T} V^\pi(s_{t+l}) - V^\pi(s_t)$. 
The exact formula for the computation is the following

$$
A_t^{\mathrm{GAE}(\gamma,\lambda)} = \sum_{l=0}^{T-1} (\gamma\lambda)^l\delta_{t + l}^V,
$$
where $\delta_{t+l}^V = r_{t+l} + \gamma V^\pi(s_{t+l+1}) - V^\pi(s_{t+l})$. You can look at the 
derivation (formulas 11-16) in the paper. Don't forget to reset the summation on terminal
states as determined by the flags `trajectory["resets"]`. You can use `trajectory["values"]`
to get values of all observations except the most recent which is stored under 
 `trajectory["state"]["latest_observation"]`. For this observation you will need to call the policy 
 to get the value prediction.

Once you computed the advantages, you can get the targets for training the value function by adding 
back values:
$$
\hat{V}(s_{t+l}) = A_{t+l}^{\mathrm{GAE}(\gamma,\lambda)} + V(s_{t + l}),
$$
where $\hat{V}$ is a tensor of value targets that are used to train the value function. 

In [10]:
def compute_gae(next_value, rewards, masks, values, gamma=0.99, tau=0.95):
    values = np.concatenate((values, [next_value]))
    #values = values + [next_value]
    gae = 0
    gaes = []
    value_targets = []
    for step in reversed(range(len(rewards))):
        delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step]
        gae = delta + gamma * tau * masks[step] * gae
        gaes.insert(0, gae)
        value_targets.insert(0, gae + values[step])
    return np.asarray(gaes), np.asarray(value_targets)


class GAE:
  """ Generalized Advantage Estimator. """
  def __init__(self, policy, gamma=0.99, lambda_=0.95):
    self.policy = policy
    self.gamma = gamma
    self.lambda_ = lambda_
    
  def __call__(self, trajectory):
    next_value = self.policy.act(trajectory['state']['latest_observation'])['values']
   # print(trajectory['values'][0], len(trajectory['values']))
    values = trajectory['values']
    env_steps = trajectory['state']['env_steps']
    rewards = trajectory['rewards']
    dones = trajectory['resets']
    is_not_done = 1.0 - dones
    
    trajectory['advantages'], trajectory['value_targets'] = compute_gae(
        next_value, rewards, is_not_done, values, self.gamma, self.lambda_)



In [11]:
class MergeTimeBatch:
  """ Merges first two axes typically representing time and env batch. """
  def __call__(self, trajectory):
    # Modify trajectory inplace. 
    #<TODO: implement>
    #'actions', 'logits', 'log_probs', 'values', 'observations', 'rewards', 'resets', 'state', 'value_targets'
    #print(type(trajectory['observations']), trajectory['observations'].shape)    
    trajectory['observations'] = torch.stack(trajectory['observations']).reshape(-1, 16, 64, 64)
    trajectory['actions'] = trajectory['actions'].reshape(-1)
    trajectory['log_probs'] = torch.stack(trajectory['log_probs']).reshape(-1)
    trajectory['value_targets'] = trajectory['value_targets'].reshape(-1)
    trajectory['advantages'] = trajectory['advantages'].reshape(-1)    
    trajectory['resets'] = trajectory['resets'].reshape(-1)
    trajectory['rewards'] = trajectory['rewards'].reshape(-1)
    
    #print(trajectory['observations'].shape, trajectory['value_targets'].shape, trajectory['log_probs'].shape)
    

The main advantage of PPO over simpler policy based methods like A2C is that it is possible
to train on the same trajectory for multiple gradient steps. The following class wraps 
an `EnvRunner`. It should call the runner to get a trajectory, then return minibatches 
from it for a number of epochs, shuffling the data before each epoch.

In [12]:
class TrajectorySampler:
    """ Samples minibatches from trajectory for a number of epochs. """
    def __init__(self, runner, num_epochs, num_minibatches, transforms=None):
        self.runner = runner
        self.num_epochs = num_epochs
        self.num_minibatches = num_minibatches
        self.transforms = transforms or []

    def get_next(self):
        """ Returns next minibatch.  """
        for _ in range(self.num_epochs):
            trajectory = self.runner.get_next()
            for t in self.transforms:
                t(trajectory)
            #may be not correct, fix me
            rand_id = np.random.permutation(N_STEPS * N_ENVS)
                
            yield {
                'observations': torch.stack([trajectory['observations'][i] for i in rand_id]),
                'actions': np.asarray([trajectory['actions'][i] for i in rand_id]), 
                'log_probs': torch.stack([trajectory['log_probs'][i] for i in rand_id]),
                'value_targets': np.asarray([trajectory['value_targets'][i] for i in rand_id]),
                'advantages': np.asarray([trajectory['advantages'][i] for i in rand_id]),
                'rewards': np.asarray([trajectory['rewards'][i] for i in rand_id]),
                'resets': np.asarray([trajectory['resets'][i] for i in rand_id])
            }
                    #state, action, old_log_probs, return_, advantage

A common trick to use with GAE is to normalize advantages, the following transformation does that. 

In [13]:
class NormalizeAdvantages:
  """ Normalizes advantages to have zero mean and variance 1. """
  def __call__(self, trajectory):
    adv = trajectory["advantages"]
    adv = (adv - adv.mean()) / (adv.std() + 1e-8)
    trajectory["advantages"] = adv

Finally, we can create our PPO runner. 

In [14]:
def make_ppo_runner(env, policy, num_runner_steps=N_STEPS,
                    gamma=0.99, lambda_=0.95, 
                    num_epochs=1, num_minibatches=32):
    """ Creates runner for PPO algorithm. """
    #num_runner_steps was 2048
    env.reset()

    runner_transforms = [AsArray(),
                         GAE(policy, gamma=gamma, lambda_=lambda_),
                         MergeTimeBatch()]
    runner = EnvRunner(env, policy, num_runner_steps, 
                       transforms=runner_transforms)
  
    sampler_transforms = [NormalizeAdvantages()]
    sampler = TrajectorySampler(runner, num_epochs=num_epochs, 
                                num_minibatches=num_minibatches,
                                transforms=sampler_transforms)
    return sampler

In [15]:
import numpy as np

a = AtariCNN(N_ACTIONS, OBS_SHAPE).to(DEVICE)
p = Policy(a)
r = make_ppo_runner(env, p)
trajectory = r.get_next()


for minibatch in trajectory:
    print(minibatch['observations'][0].type())
    print(minibatch['log_probs'][0])
    p.act(minibatch['observations'])
    print({k: v.shape for k, v in minibatch.items() if k != "state"})


torch.cuda.FloatTensor
tensor(-2.0790, device='cuda:0')
{'observations': torch.Size([160, 16, 64, 64]), 'actions': (160,), 'log_probs': torch.Size([160]), 'value_targets': (160,), 'advantages': (160,), 'rewards': (160,), 'resets': (160,)}


In the next cell you will need to implement Proximal Policy Optimization algorithm itself. The algorithm
modifies the typical policy gradient loss in the following way:

$$
L_{\pi} = \frac{1}{T-1}\sum_{l=0}^{T-1}
\frac{\pi_\theta(a_{t+l}|s_{t+l})}{\pi_\theta^{\text{old}}(a_{t+l}|s_{t+l})}
A^{\mathrm{GAE}(\gamma,\lambda)}_{t+l}\\
L_{\pi}^{\text{clipped}} = \frac{1}{T-1}\sum_{l=0}^{T-1}\mathrm{clip}\left(
\frac{\pi_\theta(a_{t+l}|s_{t+l})}{\pi_{\theta^{\text{old}}}(a_{t+l}|s_{t+l})}
\cdot A^{\mathrm{GAE(\gamma, \lambda)}}_{t+l},
1 - \text{cliprange}, 1 + \text{cliprange}\right)\\
L_{\text{policy}} = \max\left(L_\pi, L_{\pi}^{\text{clipped}}\right).
$$

Additionally, the value loss is modified in the following way:

$$
L_V = \frac{1}{T-1}\sum_{l=0}^{T-1}(V_\theta(s_{t+l}) - \hat{V}(s_{t+l}))^2\\
L_{V}^{\text{clipped}} = \frac{1}{T-1}\sum_{l=0}^{T-1}
V_{\theta^{\text{old}}}(s_{t+l}) +
\text{clip}\left(
V_\theta(s_{t+l}) - V_{\theta^\text{old}}(s_{t+l}),
-\text{cliprange}, \text{cliprange}
\right)\\
L_{\text{value}} = \max\left(L_V, L_V^{\text{clipped}}\right).
$$

In [16]:
class PPO:
  def __init__(self, policy, optimizer,
               cliprange=0.2,
               value_loss_coef=0.25,
               entropy_coef=0.01,
               max_grad_norm=0.5):
    self.policy = policy
    self.optimizer = optimizer
    self.cliprange = cliprange
    self.value_loss_coef = value_loss_coef
    self.entropy_coef=entropy_coef
    self.max_grad_norm = max_grad_norm
    
    self.ploss = None
    self.vloss = None
    self.ppo_loss = None
    
    self.vtargets = None
    self.values = None
    self.advantage = None
    self.entropy = None
    
    self.grad_norm = None
    
  def policy_loss(self, trajectory, dist, value):
    """ Computes and returns policy loss on a given trajectory. """
    advantages = torch.tensor(trajectory['advantages'], dtype=torch.float32).to(DEVICE)
    old_log_probs = torch.tensor(trajectory['log_probs'], dtype=torch.float32).to(DEVICE)
    
    a = torch.tensor(trajectory['actions']).to(DEVICE)
    new_log_probs = dist.log_prob(a)
    ratio = (new_log_probs - old_log_probs).exp()
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1.0 - self.cliprange, 1.0 + self.cliprange) * advantages
    actor_loss  = - torch.min(surr1, surr2).mean() - self.entropy_coef * dist.entropy().mean()
    
    self.ploss = actor_loss.item()
    self.advantage = torch.mean(advantages).item()
    self.entropy = torch.mean(dist.entropy().mean()).item()
    
    return actor_loss
      
  def value_loss(self, trajectory, dist, value):
    """ Computes and returns value loss on a given trajectory. """
    vtargets = torch.tensor(trajectory['value_targets'], dtype=torch.float32).to(DEVICE)
    
    self.vtargets = np.mean(trajectory['value_targets'])
    self.values = torch.mean(value).item()
    
    critic_loss = (vtargets - value).pow(2).mean()
    
    self.vloss = critic_loss.item()
    
    return critic_loss
      
  def loss(self, trajectory):
    dist, value = self.policy.act(trajectory["observations"], training=True)
    total_loss = \
        self.policy_loss(trajectory, dist, value) + \
        self.value_loss_coef * self.value_loss(trajectory, dist,value)
    self.ppo_loss = total_loss.item()
    return total_loss
      
  def step(self, trajectory):
    """ Computes the loss function and performs a single gradient step. """
    self.optimizer.zero_grad()
    self.loss(trajectory).backward()
    self.grad_norm = nn.utils.clip_grad_norm_(policy.model.parameters(), self.max_grad_norm)
    self.optimizer.step()

Now everything is ready to do training. In one million of interactions it should be possible to 
achieve the total raw reward of about 1500. You should plot this quantity with respect to 
`runner.step_var` &mdash; the number of interactions with the environment. It is highly 
encouraged to also provide plots of the following quantities (these are useful for debugging as well):

* [Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) between 
value targets and value predictions
* Entropy of the policy $\pi$
* Value loss
* Policy loss
* Value targets
* Value predictions
* Gradient norm
* Advantages

For optimization it is suggested to use Adam optimizer with linearly annealing learning rate 
from 3e-4 to 0 and epsilon 1e-5.

In [17]:
from tqdm import trange

def evaluate(policy, env, n_games=10):
    n_solved = 0
    n_steps = 0
    dist = 0
    angle = 0
    for i in range(n_games):
        s = env.reset()
        istep = 0
        while(True):
            s = s.reshape(1, 16, 64, 64)
            action = policy.act(s, determ=True)['actions'][0]
            s, reward, done, info = env.step(action)
            istep += 1
            if done:
                n_solved += istep < 200
                n_steps += istep
                dist += info['dist']
                angle += info['angle_between_beams']
                break
    return n_solved / n_games, n_steps / n_games, dist / n_games, angle / n_games

In [18]:
from tqdm import trange
import numpy as np
from tensorboardX import SummaryWriter

writer = SummaryWriter(
    'runs/ppo_fixed3')

agent = AtariCNN(N_ACTIONS, OBS_SHAPE).to(DEVICE)
opt = torch.optim.Adam(agent.parameters(), lr=3e-4, eps=1e-5)
policy = Policy(agent)

evaluate_env = ObserwationWrapper(make_interf_env(1234))
ppo = PPO(policy, opt)
runner = make_ppo_runner(
    ObserwationWrapper(make_env(nenvs=N_ENVS)), 
    policy)

entropis = []
vlosses = []
plosses = []
vtargets = []
vpredictions = []
grad_norms = []
advantages = []
ppo_losses = []
rewards = np.zeros(N_ENVS, dtype=float)
dones = np.zeros(N_ENVS, dtype=float)
steps = np.zeros(N_ENVS, dtype=float)

eval_solved_games = []
mean_eval_steps = []
n_frames = []


for i in trange(0, int(1e7), N_ENVS * N_STEPS):
    for trajectory in runner.get_next():
        ppo.step(trajectory)
    
        for batch_rewards, batch_dones in zip(trajectory['rewards'], trajectory['resets']):
            rewards += batch_rewards
            dones += batch_dones
            steps += 1

        entropis.append(ppo.entropy)
        vlosses.append(ppo.vloss)
        plosses.append(ppo.ploss)
        vtargets.append(ppo.vtargets)
        vpredictions.append(ppo.values)
        grad_norms.append(ppo.grad_norm)
        advantages.append(ppo.advantage)
        ppo_losses.append(ppo.ppo_loss)

        if np.sum(dones) >= 100:
            writer.add_scalar('entropy', np.mean(entropis), i)
            writer.add_scalar('vloss', np.mean(vlosses), i)
            writer.add_scalar('ploss', np.mean(plosses), i)
            writer.add_scalar('vtarget', np.mean(vtargets), i)
            writer.add_scalar('vprediction', np.mean(vpredictions), i)
            writer.add_scalar('grad_norm', np.mean(grad_norms), i)
            writer.add_scalar('advantage', np.mean(advantages), i)
            writer.add_scalar('loss', np.mean(ppo_losses), i)
            writer.add_scalar('reward', np.mean(rewards / dones), i)
            writer.add_scalar('steps', np.mean(steps / dones), i)

            n_solved, n_steps, dist, angle = evaluate(policy, evaluate_env)

            writer.add_scalar('eval_solved_games', n_solved, i)
            writer.add_scalar('eval_steps', n_steps, i)
            writer.add_scalar('dist_between_beams', dist, i)
            writer.add_scalar('angle_between_beams', angle, i)

            entropis = []
            vlosses = []
            plosses = []
            vtargets = []
            vpredictions = []
            grad_norms = []
            advantages = []
            ppo_losses = []
            rewards = np.zeros(N_ENVS, dtype=float)
            dones = np.zeros(N_ENVS, dtype=float)
            steps = np.zeros(N_ENVS, dtype=float)
        

       

  2%|▏         | 1219/62500 [23:49<5:38:02,  3.02it/s] Process Process-9:
Process Process-13:
Process Process-10:
Process Process-12:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
Process Process-11:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/dmitry/dev/agents/ppo/env_batch.py", line

KeyboardInterrupt: 