# REINFORCE: Vanilla Policy Gradient

In this exercise we want to implement the REINFORCE policy-gradient method to solve a reinforcement learning problem.

In order to develop the algorithm, we need:
* a function approximation (neural network) to calculate the _policy_ from the observation,
* to sample episodes and calculate the returns (or a similar measure),
* calculate the gradient of the return (involves calculating the gradients of the policy)
* apply gradient ascent to change the weights.



In [None]:
!pip install jdc
import numpy as np
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical

import matplotlib
import matplotlib.pyplot as plt
import pyglet
import ipywidgets
from IPython import display

%matplotlib inline
import sys

import jdc

## Build the model

Our agent will again use a function to generate the neural network (the model), so that we can call it using different models. For torch we will define this as class that encapsulates the network for the policy (so derived from `nn.Module` and uses a method act to calculate the action which can then be used directly in the agent.

For the loss calculation later, we will need not only the selected action, but also the (log) probability of the selected action and the gradient on it. We should make it possible to save this from the policy network. Implement the `act` method so that it return both the selected action and the log of the probability of choosing this action. You can use the method `log_prob` from `Categorical`. For the log_prob, return the tensor from torch instead of the value (i.e. do not use `item`)



In [None]:
# Define the policy network
class PolicyNetwork(torch.nn.Module):
    def __init__(self, observation_space, action_space):
        super(PolicyNetwork, self).__init__()

        # define a function self.fc that contains the network using nn.Sequential
        # YOUR CODE HERE
        raise NotImplementedError()

    def forward(self, x):
        return self.fc(x)

    def act(self, obs):
        # calculate the action and return it
        # YOUR CODE HERE
        raise NotImplementedError()

In [None]:
environment_name = 'CartPole-v1'
env = gym.make(environment_name, render_mode='rgb_array')
policy = PolicyNetwork(env.observation_space, env.action_space)
obs_sample = env.observation_space.sample()
action, log_prob = policy.act(obs_sample)
print(action, log_prob)
action_prob = policy.forward(torch.from_numpy(obs_sample).float().unsqueeze(0))
print(action_prob)
assert action == 0 or action == 1

## Agent class

Now we are ready to implement the agent class. We will start with the class definition and the `__init__` method. Check the parameters and the descriptions as they will be used in the implementation. There is one additional array `log_prob` to save the log probabilities of the actions.


In [None]:
class VPGAgent:
    """
    Implementation of (vanilla) policy gradient agent
    """

    def __init__(self, observation_space, action_space,
                 gamma: float = 0.99,
                 learning_rate: float = 0.001):
        """
        Initialize agent
        Args:
            observation_space: the observation space of the environment
            action_space: the action space of the environment
            gamma: the discount factor
            learning_rate: the learning rate
        """
        self.observation_space = observation_space
        self.action_space = action_space
        self.gamma = gamma
        self.learning_rate = learning_rate

        # generate the model
        self.policy_network = PolicyNetwork(observation_space, action_space)
        self.optimizer = optim.Adam(self.policy_network.parameters(), lr=learning_rate)

        # arrays to store an episode for training
        self.obs = []
        self.rewards = []
        self.actions = []
        self.log_probs = []

### Action

The model directly calculates the policy, so we just have to draw an action from the resuling probability distribution.


In [None]:
%%add_to VPGAgent

def calculate_action(self, obs):
    """
    Calculate the action to take
    Args:
        obs: the observation
    Returns:
        the action to take, the log probability of the action
    """
    return self.policy_network.act(obs)


In [None]:
obs_sample = env.observation_space.sample()
agent = VPGAgent(observation_space=env.observation_space, 
                          action_space=env.action_space)

action, log_prob = agent.calculate_action(obs_sample)
assert action == 0 or action == 1

## Step functions and training

Next we will add the step functions and the _training_ inside them. 

### step first

The step first function just calculates and action appends the information for our episode (observation, action). The policy is stochastic, so we should draw from the random distribution.

In [None]:
%%add_to VPGAgent

def step_first(self, obs):
    """
    Calculate the action for the first step in the environment after a reset.
    Args:
        obs: The observation from the environment
    Returns:
        the action
    """
    self.obs.append(obs)
    action, log_prob = self.calculate_action(obs)
    self.actions.append(action)
    self.log_probs.append(log_prob)
    return action

### Step and training


Simular to MC methods, updates only occur at the end of episodes and we only use the calculated return once for the gradient, however we do this for each of the actions during this episode. Therefor the update batch has the length of an episode.

The update of the gradients is according to

$$
\theta_{k+1} = \theta_k + \alpha \nabla_\theta J(\pi_\theta)
$$

where $J(\pi_\theta)$ is the loss function. In general the gradient of the loss functionhas the form

$$
\nabla\theta J(\pi\theta) = \mathbb{E}\left[\sum_{t=0}^T \nabla_\theta\Phi_t\log\pi_\theta(a_t| s_t) \right]
$$

where there are different choices for $\Phi_t$. We can for example use

$$
\Phi_t = G
$$
where $G$ is the (total) return of the episode, or use the obtained return from each state, sometimes also called the sum of the discounted future rewards.
$$
\Phi_t = \sum_{t'=t}^T R_t
$$
It can be proven, that all these choices actually lead to the same expectation of the gradient. I would suggest to use the sum of discounted future rewards.

It can also help to normalize the values to zero mean and standard deviation.

In [None]:
%%add_to VPGAgent

def step(self, obs, reward: float, done: bool):

    # simular to MC learning, we only update at the end of an episode

    # udpate the reward from the last time step (so that all arrays should now have the same length)
    self.rewards.append(reward)

    if not done:
        # we have to do the same as in the first_step: add the observation and calculate and store an action
        return self.step_first(obs)

    else:
        # an episode is finished, so we calculate the gradient and update the weights
        assert len(self.obs) == len(self.actions)
        assert len(self.obs) == len(self.rewards)
        assert len(self.obs) == len(self.log_probs)

        future_rewards = np.zeros_like(self.rewards)

       
        # YOUR CODE HERE
        raise NotImplementedError()
        
        del self.rewards[:]
        del self.obs[:]
        del self.actions[:]
        del self.log_probs[:]

        return None

In [None]:
env = gym.make(environment_name)
eval_env = gym.make(environment_name)

obs, info = env.reset()
np.random.seed(0)

agent = VPGAgent(env.observation_space, 
                 env.action_space,
                 gamma=0.99,
                 learning_rate=0.001)

# Check if one complete episode runs through
obs, _ = env.reset()
action = agent.step_first(obs)
done = False
truncated = False
while not done and not truncated:
    obs, reward, done, truncated, _ = env.step(action)
    action = agent.step(obs, reward, done)


### Training and evaluation

We add the train and evaluate methods in the agents, similar to the last exercise so that it is easier to run some tests. Nothing to code here. Note that the number of steps for training are episodes here, as we only change the weights at the end of episodes.

In [None]:
%%add_to VPGAgent
def train(self, env: gym.Env, 
          nr_episodes_train: int, 
          eval_env: gym.Env, 
          eval_frequency: int, 
          eval_nr_episodes: int,
          eval_gamma: float = 1.0):
    """
    Train the agent on the given environment for the given number of episodes.
    Args:
        env: The environment on which to train the agent
        nr_episodes_train: the number of episodes to train
        eval_env: the environment to use for evaluation
        eval_frequency: Frequency of evaluation of the trained agent (in episodes)
        eval_nr_episodes: The number of episodes to evaluate
    """
    nr_episodes = 0
    while True:
        obs, _ = env.reset()
        a = self.step_first(obs)
        done = False
        truncated = False
        while not done and not truncated:
            obs, reward, done, truncated, _ = env.step(a)
            a = self.step(obs, reward, done or truncated)

        nr_episodes += 1
        if nr_episodes % eval_frequency == 0:
            rewards = self.evaluate(eval_env, eval_nr_episodes, eval_gamma)
            print(f'Evaluation: Episode trained {nr_episodes}, mean reward: {np.mean(rewards)}')
        
        if nr_episodes > nr_episodes_train:
            return

def evaluate(self, env: gym.Env, nr_episodes: int, gamma: float = 1.0):
    """
    Evaluate the agent on the given environment for the given number of episodes.
    Args:
        env: the environment on which to evaluate the agent
        nr_episodes: the number of episodes to evaluate
        
    Returns:
        the rewards for the episodes
    """
    rewards = []
    for e in range(nr_episodes):
        obs, _ = env.reset()
        a,_ = self.calculate_action(obs)
        done = False
        truncated = False
        episode_reward = 0
        gamma_current = gamma
        while not done and not truncated:
            obs, reward, done, truncated, _ = env.step(a)
            a, _ = self.calculate_action(obs)
            episode_reward += gamma_current * reward
            gamma_current *= gamma
        rewards.append(episode_reward)
    return rewards

We train the agent for a number of steps to test

In [None]:
env = gym.make(environment_name)
eval_env = gym.make(environment_name)

obs, info = env.reset()
np.random.seed(0)

agent = VPGAgent(env.observation_space, 
                 env.action_space,
                 gamma=0.99,
                 learning_rate=0.001)

agent.train(env, nr_episodes_train=1000, eval_env=eval_env, eval_frequency=25, eval_nr_episodes=1, eval_gamma=1.0)

# calculate return at end using evaluation
return_eval = agent.evaluate(env=eval_env, nr_episodes=1, gamma=1.0)

print(f'Evaluation: {return_eval}')
  

In [None]:
# training for longer (1000 episodes) should obtain quite good results

In [None]:
def display_environment(env):
    plt.figure(figsize=(6,4))
    plt.imshow(env.render())
    plt.axis('off') 
    display.display(plt.gcf())
    display.clear_output(wait=True)
    plt.close()

env = gym.make(environment_name, render_mode='rgb_array')
obs, _  = env.reset()
for _ in range(200):
    action, _ = agent.calculate_action(obs)
    obs, _, _, _,_ = env.step(action)  # Take a random action
    display_environment(env)

Congratulations, you implemented a full policy gradient algorithm!