# 4. Policy Gradient

Welcome to the fourth and final exercise of this series. We will now train an agent that uses a neural network to predict the action probabilities directly, i.e. the policy function $\pi$, instead of the action values. This will reduce the number of intermediate steps and makes the training process simpler.

As usual we have to import a few components first.

In [None]:
import gym
import numpy as np
from datetime import datetime
from policy_gradient import PongEnv
from utils import Episode
import matplotlib.pyplot as plt

## Pong

<img src="figures/atari_2600.jpg" width="25%" align="right" /> For this final exercise we will take another step up and use an even more complex environment. Time for some actions! We will learn to play the game Pong for the Atari 2600. The OpenAI Gym library has an [Atari section](https://gym.openai.com/envs/#atari) with a lot of those classic games.

As explained in the lectures, the observation space of the default Pong game are too big for this exercise (210,160,3). It will take a lot of time and memory to train on this, so we have simplified the observations a little. We have wrapped the `PongNoFrameskip-v4` environment in the class `PongEnv` that reduces the observation size to (80,80), i.e. black and white, and automatically stacks two frames together, so you can determine the direction of the ball.

Let's create it first, and then inspect it.

In [None]:
env = PongEnv()
env.observation_space, env.action_space

 As told, the observation is two frames of 80x80 pixels. In other words, 12800 values. A lot more than the 8 values of the LunarLander environment used in the previous exercise.
 
The action space is still a discrete integer. This time 6 actions are supported, which are:
`NOOP`, `FIRE`, `RIGHT`, `LEFT`, `RIGHTFIRE`, `LEFTFIRE`. In other words, a joystick that can move left or or right and a fire button that can be pressed simultaneously.

Let's take a look at those black and white images.

In [None]:
observation = env.reset()
observation, _, _ ,_ = env.step(4)

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,5))
ax1.imshow(observation[:,:,0], cmap='gray')
ax2.imshow(observation[:,:,1], cmap='gray')

You can see in the left (current) frame that the ball has moved compared to the right (previous) frame, as well as our paddle on the right side. That's the whole reason for having two frames. You can see movement. Otherwise it would be impossible to determine which way the ball is going.

## Neural network

Like in the previous exercise we will be using TensorFlow to create a neural network. We will use the same [Keras Functional API](https://www.tensorflow.org/guide/keras/functional) as used in the previous exercise.

So, let's import the necessary TensorFlow components again. (Eager execution is again disabled to increase performance.)

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
# Disable eager execution for performance
tf.compat.v1.disable_eager_execution()

## Prediction model

<img src="figures/pong-net.png" width="50%" align="right"/> Like before we will first build a model that we can use for predicting. This time we will have to predict the action probabilities directly. We will build the neural network as described during the lecture, and depicted on the right.

This means we will first build to [`Conv2D`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D) layers, with the right values for the `filters`, `kernel_size`, `strides`, `padding` and `activation` parameters. You should leave the other parameters at their default values.

Then [`Flatten`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten) the intermediate tensor to a single dimension, so we can use it in the next [`Dense`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layers. The `Flatten` class does not have any parameters except for its name.

Finally, a [`Model`](https://www.tensorflow.org/api_docs/python/tf/keras/models/Model) should be constructed with the righ in and outputs. Then we have a model that can predict the action probabilities, i.e. the policy.

Be sure to set the `name` parameter on all layers to make it easier to check the summary later on.

In [None]:
from tensorflow.keras.models import load_model
load_model('trained/pong.h5').summary()

In [None]:
def build_model():
    ### START CODE ###
    # Create an Input layer with shape equal to the observation shape of the environment
    observation = ...
    
    # Create the two convolution layers of the depicted network
    # Use a padding parameter equal to 'valid'.
    conv1 = ...
    conv2 = ...
    
    # Create a layer to flatten the output of the convolution layers to a single dimension
    flatten = ...
    
    # Create one hidden dense layer
    dense = ...
    
    # Create the final dense layer that outputs the action probabilities.
    # Be sure to use the 'softmax' activation function!
    action_probs = ...
    
    # Create model that uses the observation layer as inputs and the action probabilites as outputs.
    model = ...
    ### END CODE ###
    return model

In [None]:
model = build_model()
model.summary()

Your summary should look similar to:

    Model: "pong"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    observation (InputLayer)     [(None, 80, 80, 2)]       0         
    _________________________________________________________________
    conv1 (Conv2D)               (None, 19, 19, 32)        3168      
    _________________________________________________________________
    conv2 (Conv2D)               (None, 4, 4, 64)          100416    
    _________________________________________________________________
    flatten (Flatten)            (None, 1024)              0         
    _________________________________________________________________
    dense (Dense)                (None, 128)               131200    
    _________________________________________________________________
    action_probs (Dense)         (None, 6)                 774       
    =================================================================
    Total params: 235,558
    Trainable params: 235,558
    Non-trainable params: 0
    _________________________________________________________________

Check the number of layers and the number of parameters. This should be exactly equal. Names could be different.

If this is not correct, then check if you passed the right layers as inputs to the other layers. Did you hook them up in the right order?

## Train model

The nice thing about this policy gradient method is that we actually don't need a different training model. TensorFlow has all the right built-in functionality that we can use to train the model with additional `Lambda` layers. The prediction model predicts probabilities for all the actions. But as explained in the lecture, we can use the cross-entropy loss in combination with the one-hot vectors of the selected actions to obtain the action probability of only the selected action. In TF this loss function is already registered with the name `'categorical_crossentropy'` which we can use. (In the previous exercise we used `'mse'` for that). However, we still need a way to pass along the returns ($G_t$), but we'll come to that later.

We will also be using an instance of the `Adam` optimizer with the given learning rate.

So, all we need to do is to [`compile`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile) the model with the right parameters.

In [None]:
def compile_model(model, learning_rate):
    ### START CODE ###
    # compile the model with the right loss and optimizer parameter.
    ...
    ### END CODE ###

## Policy

The policy that will be used for interacting with the environment will be a bit different than before. We will no longer use an $\epsilon$-greedy policy. This time we will simply pick actions at random, using the probability distribution $\pi$ that the model outputs.

Let's first take a look at that output. Run this cell to see what the output of the model looks like. Note that the observation is again reshaped to a batch of 1 using the `expand_dims` function.

(Ignore the potential deprecation warnings you can get here. Run the cell another time to get rid of them.)

In [None]:
model = build_model()
observation = env.reset()
action_probs = model.predict_on_batch(np.expand_dims(observation, axis=0))[0]
print(f'pi(s) = {action_probs}')
print(f'sum = {action_probs.sum():.6f}')

You can see that the model returns 6 numbers close to the 0.166667 we expect for a completely random policy. This means the initial weights of the model are such that the input of the softmax is similar for all the actions, and therefore the probability for all actions is similar. If you run the cell multiple times you will see the values change, but not by much.

All 6 probabilities should sum up to 1 (with some floating point arithmetic errors).

Now we can implement the policy in the following function.

First we have to predict the action probabilities (the policy) using the model. [`predict_on_batch`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict_on_batch) will require a batch as input, so use [`np.expand_dims`](https://numpy.org/doc/stable/reference/generated/numpy.expand_dims.html) with `axis=0` to make it a batch of 1. Then take the first item of the list that `predict_on_batch` returns to convert the batch of 1 back to a single item.

NumPy has a nice function for randomly selecting a value using a given probability distribution, called [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html). Its first parameter can simply be an integer (e.g. `6` or `len(probs)`), making it select a value between 0 and this value. The parameter `p` can be used to give it a probability distribution, which should have a length as given by the first parameters.

In [None]:
def select_action(model, observation):
    ### START CODE ###
    action_probs = ...
    action = ...
    ### END CODE ###
    return action

## Episode

The algorithm we are using (REINFORCE) is a Monte Carlo or episodic algorithm. That means we only train once the entire episode is finished. So let's make a function that runs a complete episode and stores all required data.

We will use the helper class `Episode` to store the data, which is implemented in the [utils.py](utils.py) file. Take a look if you are interested. It has a constructor that needs a `capacity`, `observation_shape` and `action_shape` parameters.

For capacity we we will use `2000`. This means we will store at most 2000 observations. This is enough for our case because we won't train it beyond a certain level. A buffer of 2000 observation still requires at least `2000*80*80*2*sizeof(float32) = 98 MB` of memory. If we would have used the raw observations (2 frames stacked), then it would have needed `2000 * 210*160*3 * 2 * sizeof(float32) = 1.5 GB`. With 10+ simultaneous users that would have been a bit too much for this server.

Furthermore, it has a function `append` that requires an observation, action and reward, which it will store. The action should be a one-hot vector. You can use the (already imported) function [`to_categorical`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical) for that. It only requires a `num_classes` parameter which should be `6` in this case.
 
So implement the following function.

In [None]:
def run_episode(env, model):
    episode = Episode(2000, (80,80,2), (6,))

    ### START CODE ###
    # Reset the environment
    observation = ...
    done = False

    # Loop until the episode is finished
    ...:
        # Select an action for the current observation using the model
        action = ...

        # Take an environment step with that action.
        next_observation, reward, done, _ = ...

        # Convert action to a one-hot vector
        action = ...
        
        # Append the observation, action and reward to the episode
        ...

        # Make the next observation the current observation
        observation = next_observation

    ### END CODE ###
    return episode

Let's see if you implemented it correctly by running one episode and plotting the rewards.

In [None]:
model = build_model()
episode = run_episode(env, model)

observations, actions, rewards = episode.get()

plt.figure(figsize=(15,5))
plt.plot(rewards)

You can clearly see the moments in time that it receives a reward by the spikes in the graph. Not all rewards are received at the same interval, sometimes it takes a bit longer. Most of them, if not all, rewards are -1.

## Expected return

We will not be using a value estimation function in this algorithm, so we have to compute the expected return ($G_t$) ourselves. Remember from the lectures that this is defined iteratively as follows.

$$ G_t = R_t + \gamma G_{t+1} $$

That means that if you go backwards from the end of the episode to the front then you can compute it easily. Let's do that in the function we call `discount_rewards`. It receives the rewards array gathered during an episode and the discount rate `gamma`. 

In [None]:
def discount_rewards(rewards, gamma):
    returns = np.zeros_like(rewards)
    ### START CODE ###
    # Initialize current return.
    # This is the expected return for the terminal state, i.e. G_T.
    current_return = ...
    ### END CODE ###
    
    # Go in reverse
    for t in reversed(range(len(rewards))):
        ### START CODE ###
        # Compute the return at time t, using
        # the "previous" return of (t+1) and the current reward.
        current_return = ...
        ### END CODE ###
        returns[t] = current_return
        
    return returns

Let's plot the effect of this discounting.

In [None]:
returns = discount_rewards(rewards, 0.9)

plt.figure(figsize=(15,5))
plt.plot(returns)

You should now clearly see that the spikes have longer tails to the left of them. This means that the expected return of an action taken at a time before a spike (i.e. a reward) will also be regarded positive or negative. This will effectively give a reward to an action proportional to how good the future will be. It is an estimated of the expected value of that action; an estimate of $q(s,a)$.

## Training step

Again, the final piece missing is the training step. This time we will fit the model to the data of an entire episode, instead of a random batch of experience.

We will first have to convert the rewards of the episode to the expected returns ($G_t$). Use the just implemented `discount_rewards` function.

As promised, we still have to explain how to get the returns into the training procedure. As explained in the lecture, the loss function we can use that implements the policy gradient method looks like:

$$ L(\theta) = G \mathop{\text{CrossEntropy}} ({\pi_\theta}(S), A) $$

The [`fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) function of the model, which we used before, has a nice parameter called `sample_weight`. It will use this to scale the loss function for all the samples in the batch. It will multiply the losses (cross-entropy in this case) with the corresponding scalar value in the `sample_weight` input. So if we use `returns` as sample weights, then we get exactly what we needed.

Let's try that.

In [None]:
def train(model, episode, gamma):
    # Unpack the episode
    observations, actions, rewards = episode.get()

    ### START CODE ###
    # Compute the expected returns using the discount function
    returns = ...
    ### END CODE ###
    
    # Normalize the returns to make training more efficient
    returns -= returns.mean()
    returns /= returns.std()

    ### START CODE ###
    # Fit the model to the observations and use actions as target to compute the cross-entropy loss.
    # Scale the losses for each same using returns as sample_weights.
    # Run for only 1 epoch (epochs=1), and turn off logging with verbose=0.
    ...
    ### END CODE ###

## Training loop

Time for the actual training. We start with a number of hyperparameters that we can change if needed.

In [None]:
nr_episodes = 250
learning_rate = 3e-5
gamma = 0.99

Build and compile a fresh new model that is initialized randomly.

In [None]:
model = build_model()
compile_model(model, learning_rate)

And finally let's implement the loop itself. It is actually very short. We run an entire episode with the model, train on that episode, and repeat.

We will keep track of the individual scores of the episodes so we can nicely plot them when we are done.

**Training will again take about 15 minutes.** Take a break, and come back to see what happened.

In [None]:
scores = []

start_time = datetime.now()
for e in range(1, nr_episodes+1):
    ### START CODE ###
    # Run an episode
    episode = ...
    # Train on episode
    ...
    ### END CODE ###

    # Keep track of score
    scores.append(episode.score)
    # Show statistics
    delta = (datetime.now() - start_time).total_seconds()
    print(f'[{delta:.1f}] #{e}: length {episode.length}, score {episode.score:.0f}')

# Save trained model
model.save('pong.h5')

Again this took a long time. Take a look at the scores.

In [None]:
plt.figure(figsize=(15,5))
plt.plot(scores)

It barely improved! How can this be? There are multiple reasons, but the main reason is the REINFORCE algorithm we used. The lecture will explain it in more detail.

## Evaluate

Even though the training didn't look promising, we will still take a look at how well it performs now. Like before we will use an evaluation function to run an entire episode, display the frames, and return the total score. (Already implemented; simply run the cell.)

In [None]:
from utils import create_frame, update_frame

def evaluate(env, model):
    # Reset environment
    observation, done, score = env.reset(), False, 0

    # Setup display for first frame
    frame = create_frame(env)

    while not done:
        # Predict action probabilities
        action_probs = model.predict_on_batch(np.expand_dims(observation, axis=0))[0]
        # Select action with highest probability
        action = np.argmax(action_probs)
        # Perform action
        observation, reward, done, _ = env.step(action)
        score += reward

        # Update displayed frame
        update_frame(frame)
    return score

Now let's see how the initial, random model behaves, which is actually not that bad. It regularly hits the ball.

In [None]:
score = evaluate(env, build_model())
print(f'score {score:.1f}')

Now take a look at the model we trained.

In [None]:
from tensorflow.keras.models import load_model
score = evaluate(env, load_model('pong.h5'))
print(f'score {score:.1f}')

It doesn't do well. You can it twitch, but it fails miserably. That is mainly caused by the lack of training. It uses the `argmax` function, but that's not the right choice yet. More training is needed.

Let's take a look how this neural network (i.e. model) can perform when trained with a proper algorithm (PPO) for a longer time.

In [None]:
score = evaluate(env, load_model('trained/pong.h5'))
print(f'score {score:.1f}')

It really owns the game. It learned the neat trick to hit the ball in such a way that the game's AI isn't fast enough to catch it, and it repeats this trick constantly.

## Finish

This marks the end of the exercises. You have now learned to implement all the basic reinforcement algorithms.

This last one, Policy Gradient, is actually quite simple to implement, but it is really hard to get it working right. There are some simple improvements that you can use to get it working a lot better, which will be discussed in the lectures.

We hope you enjoyed it. Have fun applying your newly acquired knowledge in other, more engaging environments!