# Policy Gradient Methods

Policy _based_ methods learn the optimal policy directly, without necessarily estimating a value
function. Policy _gradient_ methods do that performing gradient ascent on the objective function.

### Advantages

 * No need to store action-values.
 * Ability to learn a stochastic policy direcly.
 * Hence, no need to manually tune exploitation vs. exploration.
 * Effective in continuous action spaces (and high-dimensional state spaces).
 * They generally have good convergence properties.

### Disadvantages

 * They might have high-variance.
 * Might converge to a local maximum.
 * Slower than other methods, and might take a long time to train.

In [1]:
import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
from typing import Union

import gymnasium as gym

from util.gymnastics import DEVICE, gym_simulation, init_random, plot_scores

## CartPole Environment

In [None]:
# `gym_simulation` is a convenient utility to run simulations on Gynmasium environment, so that we
# don't have to repeat the same code in all notebooks. But it works basically the same as the one we
# implemented for DQN :) Feel free to check the source code!

gym_simulation("CartPole-v1")

In [None]:
# Just for convenience, we hardcode the state and action sizes of the CartPole environment.
STATE_SIZE  = 4
ACTION_SIZE = 2

## Optimization Rule

For one trajectory $\tau$ (or episode), the neural networks weight can be updated according to:

$$
\theta_{k+1} = \theta_k - \alpha \sum_{t=0} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) R(\tau)
$$

We can interpret this as pushing up probabilities for action / states combinations when the return
is high, and the other way around for low returns.

This relationship is also interesting because only the policy function needs to be differentiable:
the reward function might very well be discontinuous and sparse.

For derivation, check the [Hugging Face Deep RL tutorial](https://huggingface.co/learn/deep-rl-course/unit4/pg-theorem).

## REINFORCE

<div style="width: 70%">
  <img src="assets/04_PG_reinforce.png">
  <br>
  <small>Sutton & Barto 2022</small>
</div>

In [5]:
class PolicyNetwork(nn.Module):
    """The neural network computing the stochastic policy (action probabilities for a state)."""
    def __init__(self, hidden_units=16):
        super(PolicyNetwork, self).__init__()
        # TODO: Create two fully connected / linear layers, input dimension STATE_SIZE, output
        #       dimension ACTION_SIZE, and hidden units specified in the constructor.

    def forward(self, x):
        # TODO: Use ReLU as first non-linearity, and softmax for the output layer (so that we can
        #       interpret this as a stochastic policy across action probabilities). Hint: make sure
        #       to choose the right dimension for the softmax!
        pass

In [6]:
class Agent:
    def __init__(self):
        self.policy = PolicyNetwork()
        self.optimizer = optim.Adam(self.policy.parameters(), lr=1e-2)

    def sample_action(self, state: np.array):
        """Samples an action for a state from the policy network."""
        # TODO: Convert the state to a PyTorch tensor.
        state = None
        # TODO: Get the action probabilities from the policy.
        probs = None
        # TODO: Create a Categorical distribution with those probabilities.
        cdist = None
        # TODO: Sample the action from the categorical distribution.
        action = None
        # TODO: Return the action (hint: use item()), and the log_prob for that action (hint: use
        #       the distribution again!)
        return None

    def learn(self, log_probs: list[torch.Tensor], returns: Union[np.float64, np.array]):
        """Perform a step of learning (gradient ascent) on the policy network."""
        # TODO: For reasons you'll see below, rewards can be either a scalar or an array. Let's
        #       create a corresponding tensor first.
        returns = None

        # TODO: Compute the policy loss. Hint: gradient ascent.... so negative sign!
        policy_loss = None

        # TODO: Perform a backprop step via the optimizer.
    
    @torch.no_grad
    def act(self, state):
        """Convenient method for the agent to select an action during simulation."""
        return self.sample_action(state)[0]

In [7]:
def REINFORCE(env, agent, max_episodes=10_000, max_t=1_000, gamma=1.0):
    # Tracks the score for each episode.
    scores = []
    # Loop for max_episodes.
    for i_episode in range(1, max_episodes + 1):
        # Store episode rewards.
        rewards = []
        # Store episode log probabilities
        log_probs = []
        # Start the episode in the initial state.
        state, _ = env.reset()

        # TODO: Generate an episode following the policy, for T (max_t) timesteps.
        for _ in range(max_t):
            # TODO: sample action and log probability.
            action, log_prob = None
            # TODO: perform a step in the environment.
            state, reward, terminated, truncated, _ = None
            # TODO: store reward and log probability.
            # TODO: Check for episode completion.

        # Compute discounted return.
        # TODO: First, compute the discounts.
        discounts = None
        # TODO: Then compute the total discounted return as the sum of the discouted rewards.
        R = None

        # TODO: Perform a learning step on the agent calling the `learn` method.

        # Track scores and print statistics.
        scores.append(sum(rewards))
        avg_score = np.mean(scores[-100:])
        if i_episode % 100 == 0:
            print(f'Episode {i_episode}\tAverage Score: {avg_score:.2f}')
        if avg_score >= 490.0: # Solved
            print(f'Environment solved at episode {i_episode}\tAverage Score: {avg_score:.2f}')
            break

    return scores

In [8]:
agent = Agent()
with init_random(gym.make('CartPole-v1')) as env:
    scores = REINFORCE(env, agent)
plot_scores(scores)

In [9]:
gym_simulation("CartPole-v1", agent)

## Improvements

### Use Future Rewards

First thing to notice is that we are using all rewards at every timestep. But really, we should only
consider future rewards: i.e., the rewards that are actually the consequences of our actions.

$$
g = \sum_t R_t^{future}\nabla_{\theta} log\pi_{\theta}(a_t | s_t)
$$

### Normalize Rewards

Another technique we can use (especially when collecting multiple trajectories, that will come later
on!) is to normalize the rewards: which roughly picks half actions to encourage / discourage, and
keeps the gradient updates moderate.

$$
R_k \leftarrow \frac{R_k - \mu}{\sigma}
$$

### Baseline Subtraction

The idea is to subtract to the reward a _baseline_ $b$, for example the average reward along all
trajectories (What if every trajectory has _always_ positive returns?). In this case, things that
are above average will push their probabilities to happen up while things below average will be
penalized.

$$
\theta_{k+1} = \theta_k - \alpha \sum_{t=0} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) [R(\tau) - b]
$$

We can do this because on expectation this extra subtracted term will have zero effect, but overall
we'll get reduced variance (proof left as exercise and/or you can find it in the resources).

#### Advantage Function

This value that we multiply with the log-probability to "reinforce" or "depress" the corresponding
actions is called the _advantage function_ and plays a critical role in state-of-the-art algorithms:

$$
A(*) = R(\tau) - b
$$

It measures how better the selected action does compared to the _average_ in the state.

In [22]:
def REINFORCE_v2(env, agent, max_episodes=10_000, max_t=1_000, gamma=1.0):
    # Tracks the score for each episode.
    scores = []
    # Loop for max_episodes.
    for i_episode in range(1, max_episodes + 1):
        # Store episode rewards.
        rewards = []
        # Store episode log probabilities
        log_probs = []
        # Start the episode in the initial state.
        state, _ = env.reset()

        # TODO: Generate an episode following the policy. Copy the code above :)
        for _ in range(max_t):
            pass

        # Compute discounted _future_ returns.
        # TODO: Compute the discounts and discounted rewards.
        discounted_rewards = None
        # TODO: Compute the future_returns. Hint: consider cumulative sums in reverse order :)
        future_returns = None

        # Baseline.
        # TODO: Use the average of future returns as baseline.
        baseline = None
        # TODO: Subtract the baseline from the future returns.
        future_returns = None

        # Normalization.
        # TODO: normalize the returns computing mean and standard deviation. Hint: make sure the std
        #       is non zero; use np.mean and np.std.
        normalized_returns = None

        # TODO: Perform a learning step calling agent.learn(...)
        # copy() for negative strides :(
        #   https://discuss.pytorch.org/t/negative-strides-in-tensor-error/134287/2

        # Track scores and print statistics
        scores.append(sum(rewards))
        avg_score = np.mean(scores[-100:])
        if i_episode % 100 == 0:
            print(f'Episode {i_episode}\tAverage Score: {avg_score:.2f}')
        if avg_score >= 490.0: # Solved
            print(f'Environment solved at episode {i_episode}\tAverage Score: {avg_score:.2f}')
            break

    return scores

In [None]:
agent_v2 = Agent()
with init_random(gym.make('CartPole-v1')) as env:
    scores_v2 = REINFORCE_v2(env, agent_v2)
plot_scores(scores_v2)

In [None]:
gym_simulation("CartPole-v1", agent_v2)

## What Can We Do Better?

There are other improvements that can be applied:

* To _reduce noise_ on the gradient, we can simply sample multiple different trajectories and learn
  from all of those. Vectorized environments will help with this!
* Actor-critic setup and advanced advantage estimation such as _GAE_ will improve learning.
* We are currently _discarding experiences_ after every learning step. That is because the policy
  effectively changes. But we'll see that with importance sampling we can iterate on the same data
  multiple times and learn in mini-batches!
* Techniques such as "trust region" and "_loss clipping_" will help against degeneration and keep
  the policy learning along smooth gradient directions.

Once we put all of these in place... we'll have PPO!

## Appendix

### Meaning of Loss

Note that the loss function used in policy gradient methods doesn't have the same meaning of the
typical supervised learning setup. In particular, after that first step of gradient descent, there
is no more connection to performance - which is determined by the average return.

The loss function is only useful when evaluated at the current parameters to perform one step of
gradient ascent. After that it loses its meaning and it's value shouldn't be used as a metric for
performance.

More details on [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#id14).
