## Vanilla Policy Optimisation

<a href="https://colab.research.google.com/github/EffiSciencesResearch/ML4G/blob/main/days/w1d5/vanilla_policy_gradient.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Preliminary questions:
- Run the script with the defaults parameters on the terminal
- Explain from torch.distributions.categorical import Categorical
- google gym python, why is it useful?
- Policy gradient is model based or model free?
- Is policy gradient on-policy or off-policy?

Read all the code, then:
- Complete the ... in the compute_loss function.
- Use https://github.com/patrick-kidger/torchtyping to type the functions get_policy, get_action. You can draw inspiration from the compute_loss function.
- Answer the questions


Don't begin working on this algorithms if you don't understand the blog: https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html

This exercise is short, but you should aim to understand everything in this code. Simply completing the types is not sufficient. The important thing here is to have a good understanding of each line of code, as well as the policy gradient theorem that we are using.

In [1]:
#!pip install torchtyping
#!pip install typeguard==2.13.3

In [1]:
import torch
import torch.nn as nn
from torch.distributions.categorical import Categorical
from torch.optim import Adam
import numpy as np
import gym # by openAI, standardized environments for RL
# not maintained anymore, gymnasium is the newer one
from gym.spaces import Discrete, Box

# torchtyping
from torchtyping import TensorType, patch_typeguard
# checks at runtime for types
from typeguard import typechecked

patch_typeguard()  # use before @typechecked


def mlp(sizes, activation=nn.Tanh, output_activation=nn.Identity):
    # Build a feedforward neural network. Basically having 4 as input layer for the state and 2 as output layer for the action
    layers = []
    for j in range(len(sizes)-1):
        act = activation if j < len(sizes)-2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j+1]), act()]

    # What does * mean here? Search for unpacking in python
    # We want to unpack the list since sequential wants
    return nn.Sequential(*layers)

def train(env_name='CartPole-v0', hidden_sizes=[32], lr=1e-2,
          epochs=50, batch_size=5000, render=False):


    # make environment, check spaces, get obs / act dims
    # create environment
    env = gym.make(env_name)
    # gym environments define observations we can get and also the actions
    # we can take
    assert isinstance(env.observation_space, Box), \
        "This example only works for envs with continuous state spaces."
    assert isinstance(env.action_space, Discrete), \
        "This example only works for envs with discrete action spaces."

    # get observation space dimension
    obs_dim = env.observation_space.shape[0]
    # get number of activation space
    n_acts = env.action_space.n

    # Core of policy network
    # What should be the sizes of the layers of the policy network?
    # obs dimension is 4, so the dimension of our states
    # action dimension is 2, so two actions to take
    # so we have one input layer where the state gets in and we get the
    # action out
    logits_net = mlp(sizes=[obs_dim]+hidden_sizes+[n_acts])

    # make function to compute action distribution
    # What is the shape of obs?
    # We can have a batch in the beginning, but then just 4 since this is
    # the dimension of our state space
    @typechecked # To be typed
    def get_policy(obs: TensorType[... , obs_dim]):
        # Now we simply get a probability distribution on which action to
        # take from our current state position. So logits is a scrore
        # for which action to take, so our NN is our policy, the logits_net
        # Warning: obs sometimes has a batch dimension, sometimes there is no such dimension
        logits = logits_net(obs)
        # Tip: Categorical is a convenient pytorch object which enable register logits (or a batch of logits)
        # and then being able to sample from this pseudo-probability distribution with the ".sample()" method.
        # categrorial is useful to sample logits
        # It is used to somehow change the output
        # Create a probability distribution out of our outputs from the NN
        # Like a softmax
        return Categorical(logits=logits)

    # make action selection function (outputs int actions, sampled from policy)
    # What is the shape of obs?
    # We have just 4 since this is
    # the dimension of our state space
    @typechecked # To be typed
    def get_action(obs: TensorType[obs_dim]):
        # Here we take an action from our policy, and sample an action from it
        # Sample to take one of the actions given on the probability
        # distribusion that we earlier calculated out of the NN
        return get_policy(obs).sample().item()

    # make loss function whose gradient, for the right data, is policy gradient
    # What is the shape of obs?
    # Here we have a batch dimension, so 77 e.g. and then the dimension of
    # our state space
    # has the shape [77, 4]
    @typechecked
    def compute_loss(obs: TensorType["b", obs_dim], acts: TensorType["b"], rewards: TensorType["b"]):
        """TODO"""
        # So we have our observation space tensor with states which are of 4 dimension
        # We have our action vector, with batch as first part and the action value which was taken
        # We have our reward tensor with the rewards per state, and the reward we got for that action and state

        # rewards: a piecewise constant vector containing the total reward of each episode.

        # Use the get_policy function to get the categorical object, then sample from it with the 'log_prob' method.‹
        # So here we basically get the outcome of our policy for every
        # state, so the outcome of our NN
        # and since we also have the reward (which is kinda like our label)
        # we can do a computation to calculate a loss

        # So again
        # we need whole batch of logits (so actions given states, so our
        # probability), and of this we actually need the log
        # pass actions to logprob such that it knows to which actions it
        # refers, so the logits

        # Then multiply to return

        # And then take the mean since we Sum up and divide up
        # Since we want to increase the reward, so we take minus so we
        # actually to gradient ascent
        logid = get_policy(obs).log_prob(acts)
        return -(logid * rewards).mean()

    # make optimizer
    # give it the parameter of the policy so the NN
    optimizer = Adam(logits_net.parameters(), lr=lr)

    # for training policy
    def train_one_epoch():
        # make some empty lists for logging.
        batch_obs = []          # for observations
        batch_acts = []         # for actions
        batch_weights = []      # for R(tau) weighting in policy gradient.
        # This is our reward, the weights
        batch_rets = []         # for measuring episode returns # What is the return?
        # returns the same thing as weights
        batch_lens = []         # for measuring episode lengths

        # reset episode-specific variables
        obs = env.reset()       # first obs comes from starting distribution
        done = False            # signal from environment that episode is over
        # This is now for one trajectory and not for the whole batch
        ep_rews = []            # list for rewards accrued throughout ep

        # render first episode of each epoch
        finished_rendering_this_epoch = False

        # collect experience by acting in the environment with current policy
        while True:

            # rendering
            if (not finished_rendering_this_epoch) and render:
                env.render()

            # save obs
            batch_obs.append(obs.copy())

            # act in the environment
            # so we pass a tensor to it of our obs
            act = get_action(torch.as_tensor(obs, dtype=torch.float32))
            # here we do a step within our environment according to the
            # action of our policy and thus the parameters of our NN,
            # which was used to sample this action
            obs, rew, done, _ = env.step(act)
            # as a return we get the new state, the reward

            # save action, reward
            batch_acts.append(act)
            ep_rews.append(rew)
            # we do this a few times until episode is over
            if done:
                # if episode is over, record info about episode
                # Is the reward discounted?
                ep_ret, ep_len = sum(ep_rews), len(ep_rews)
                batch_rets.append(ep_ret)
                batch_lens.append(ep_len)

                # the weight for each logprob(a|s) is R(tau)
                # Why do we use a constant vector here?
                batch_weights += [ep_ret] * ep_len

                # reset episode-specific variables
                obs, done, ep_rews = env.reset(), False, []

                # won't render again this epoch
                finished_rendering_this_epoch = True

                # end experience loop if we have enough of it
                if len(batch_obs) > batch_size:
                    break

        # take a single policy gradient update step
        # So we are done with one epoch now
        # Now we want to calculate the gradient for our NN to update the policy
        optimizer.zero_grad()

        # For this we want to calculate the loss in this specific way based
        # on our state, action and rewards and then update our weights and
        # thus our policy
        batch_loss = compute_loss(obs=torch.as_tensor(batch_obs, dtype=torch.float32),
                                  acts=torch.as_tensor(batch_acts, dtype=torch.int32),
                                  rewards=torch.as_tensor(batch_weights, dtype=torch.float32)
                                  )
        # After calculating the loss, we do the gradient step
        batch_loss.backward()
        optimizer.step()
        return batch_loss, batch_rets, batch_lens

    # training loop
    for i in range(epochs):
        batch_loss, batch_rets, batch_lens = train_one_epoch()
        print('epoch: %3d \t loss: %.3f \t return: %.3f \t ep_len: %.3f'%
              (i, batch_loss, np.mean(batch_rets), np.mean(batch_lens)))


In [3]:
train(env_name='CartPole-v0', hidden_sizes=[32], lr=1e-2,
      epochs=50, batch_size=50, render=False)

epoch:   0 	 loss: 10.327 	 return: 14.500 	 ep_len: 14.500
epoch:   1 	 loss: 10.420 	 return: 14.750 	 ep_len: 14.750
epoch:   2 	 loss: 44.778 	 return: 63.000 	 ep_len: 63.000
epoch:   3 	 loss: 23.943 	 return: 32.500 	 ep_len: 32.500
epoch:   4 	 loss: 15.032 	 return: 19.667 	 ep_len: 19.667
epoch:   5 	 loss: 11.608 	 return: 15.500 	 ep_len: 15.500
epoch:   6 	 loss: 26.550 	 return: 29.500 	 ep_len: 29.500
epoch:   7 	 loss: 13.175 	 return: 17.750 	 ep_len: 17.750
epoch:   8 	 loss: 23.596 	 return: 28.000 	 ep_len: 28.000
epoch:   9 	 loss: 25.860 	 return: 36.000 	 ep_len: 36.000
epoch:  10 	 loss: 43.928 	 return: 64.000 	 ep_len: 64.000
epoch:  11 	 loss: 15.997 	 return: 23.000 	 ep_len: 23.000
epoch:  12 	 loss: 19.899 	 return: 25.000 	 ep_len: 25.000
epoch:  13 	 loss: 21.465 	 return: 31.500 	 ep_len: 31.500
epoch:  14 	 loss: 20.834 	 return: 30.500 	 ep_len: 30.500
epoch:  15 	 loss: 20.667 	 return: 27.333 	 ep_len: 27.333
epoch:  16 	 loss: 17.843 	 return: 25.0

In [None]:
# Original algo here: https://github.com/openai/spinningup/blob/master/spinup/algos/pytorch/vpg/vpg.py