# 4. Actor-Critic

Welcome to the last exercise in this series. We are going to do this one a little differently. In the previous chapter you implemented an agent with the DQN algorithm. In this chapter we are going to implement it using the Actor-Critic algorithm. But instead of building all the required methods step by step, we are going to take the previous implementation and refactor it to an Actor-Critic model.

So, let's start again with importing some stuff.

In [None]:
import gym
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt

And initialize some display libraries.

In [None]:
from utils import init_display
init_display()

We will be using the same Gym environment, `LunarLander-v2`.

In [None]:
env = gym.make('LunarLander-v2')
env._max_episode_steps = 500

## Actor neural network

The Actor part in this algorithm is a neural network that predicts the action probabilities of the policy. We use the already implemented `PolicyEstimator` class from ['estimator.py'](estimator.py). If you are interested and if you have time then you could take a look at the implementation.

Let's see how that class works.

In [None]:
from estimator import PolicyEstimator

actor = PolicyEstimator(env)

observation = env.reset()
actor.predict(observation)

It should return a list of four non-zero probabilities (values between 0.0 and 1.0) that should all sum up to 1. This is the probability distribution of the policy for this state. Give it another state (i.e. observation), and it will different values. You can see this if you re-run this cell. The output should be different.

## Select action

Instead of the $\epsilon$-greedy policy, we are now going to select an action randomly according to the distribution given by the actor. So, actions with a higher probability will be selected more often.

Below you see the **old** implementation of `select_action` of the previous exercise. You should refactor this to the new behavior.
- Remove the `epsilon` parameter.
- (optional) Rename parameter `estimator` to `actor`
- Remove epsilon greedy part.
- Predict action probabilities with `actor` instead of predicting action values
- Select a random action using the given probability distribution. See [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) for more information. First parameter should be the number of actions to choose from, the and parameter `p` should be set to action probabilities.

In [None]:
def select_action(env, estimator, observation, epsilon):
    ### START CODE ###
    
    # Implement epsilon greedy algorithm.
    # With probability `epsilon` take a random action,
    # otherwise take the action with the highest probability.
    if np.random.uniform() < epsilon:
        action = env.action_space.sample()
    else:
        action_values = estimator.predict(observation)
        action = np.argmax(action_values)
        
    ### END CODE ###
    return action

Let's see if it works. If you run the following cell, you should get a single integer between 0 and 4 as a result.

In [None]:
observation = env.reset()
action = select_action(env, actor, observation)
assert 0 <= action < 4
action

## Critic neural network

The Critic in this algorithm is a neural network that estimates state values (i.e. $v(s)$). We will use the already implemented `StateValueEstimator` class from [`estimator.py`](estimator.py).

Let's create one and see how it works.

In [None]:
from estimator import StateValueEstimator

critic = StateValueEstimator(env)
critic.load_weights('data/lunar-critic.h5')

observation = env.reset()
critic.predict(observation)

You should see that it returns only one value, which represents the expected return from this state, when following the learned policy. This will be trained later.

## Target values

The previous algorithm (DQN) only produced one set of values needed for training, namely estimations for $q(s,a)$. However, for this algorithm we need two sets.
- The advantages that are used as sample weights when training the actor.
- The target values for $v(s)$ used to train the critic.

Let's make two new functions to replace the `compute_target_values` function of the previous exercise.

### Advantages
Let's start with the advantages, and make a new function that computes the advantages as follows.

$$
A_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)
$$

In other words, compute the value of the current transition (reward + expected return of next state) and subtract the estimate of expected return of this state.

If we received more reward than predicted, then the advantage will be positive and the actor should predict taking this action in this state more often. And of course, the actor should predict taking this action less often when the advantage is negative.

In [None]:
def compute_advantages(critic, observations, rewards, next_observations, dones, gamma):
    ### START CODE ###
    
    # Compute state values for both the current observations and the next observations
    # using the critic estimator.
    state_values = ...
    next_state_values = ...
    
    # Compute the advantage
    advantages = ...
    
    ### END CODE ###
    return advantages

### Discounted rewards

For the critic we are going to use the return ($G_t$) of each state visited in this episode. Remember, the return of state $S_t$ is described as.

$$\begin{align}
G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} \cdots \\
    &= R_{t+1} + \gamma G_{t+1}
\end{align}$$

We can easily compute the $G_t$ for each state in the trajectory of this episode, by traversing the rewards backwards. We start at $t = T$ and go backward to $t = 0$. Remember, $G_T$ is 0, i.e. the terminal state has no expected return.

In python you can simply reverse a list using the built-in function `reversed(a)` and you can insert an item at the start of a list with `a.insert(0, v)`.

In [None]:
def discount_rewards(rewards, gamma):
    discounted_rewards = []
    ### START CODE ###
 
    ...
    ...
    ...
    
    ### END CODE ###
    return np.array(discounted_rewards)

Let's see if it works by running this small unit test.

In [None]:
rewards = np.array([0, 1, 2, 3, 0, -3])
discount_rewards(rewards, 0.9)

The output should be exactly:

    array([ 2.93553,  3.2617 ,  2.513  ,  0.57   , -2.7    , -3.     ])

If this is not the case (besides floating point rounding errors), then check your implementation.

### Training step

Now it is time to refactor the implementation of the `train` function from previous exercise.  This time the function will get the data of a complete trajectory in order (i.e. whole episode).

Below is the implementation of the previous exercise. You have to refactor it. The basic structure is still the same. First step is to compute the target values, then use that data to train the estimators.
- Rename parameter `estimator` to `actor`.
- Add parameter `critic` (after `actor`).
- Replace old `target_q` with:
    - Compute the advantages using the function implemented above.
    - Compute discounted rewards (i.e. returns) using the function implemented above.
- Train both `actor` and `critic`:
    - `actor` required `observations`, `actions` and `advantages`.
    - `critic` required `observations` and `returns`.

In [None]:
def train(estimator, observations, actions, rewards, next_observations, dones, gamma):
    ### START CODE ###

    # Compute target values
    target_q = compute_target_values(estimator, rewards, next_observations, dones, gamma)

    # Train the estimator to this data.
    # It needs both the `observations` and `actions` as input parameters to the neural network
    # and it needs `target_q` as target for the output of the neural network
    estimator.train_on_batch(observations, actions, target_q)
    
    ### END CODE ###

We'll test this implementation in the next section.

## Training

Now, let's update the training procedure. First let's set some parameters. We won't be needing the `epsilon` and `batch_size` anymore, so they are left out. The other parameters remain the same.

In [None]:
gamma = 0.99
max_time = 300 # seconds

Now, recreate the two estimators, so we can start fresh.

In [None]:
### START CODE ###
...
...
### END CODE ###

### Experience replay buffer

In the DQN algorithm we used a buffer to store the gathered experience. For the Actor-Critic algorithm we work episode-by-episode, so we will reuse the implementation of the experience buffer here to store the trajectory. For this we will need two extra functions:
- `experience_buffer.get_all()` to get all the data stored in the buffer in correct order.
- `experience_buffer.reset()` clear the buffer, so a new episode can be stored.

First, create a buffer object with 1000 samples.

In [None]:
from utils import ExperienceBuffer
experience_buffer = ExperienceBuffer(1000, env.observation_space.shape, (1,))

### Train loop

Finally it is time to update the training loop. Below is the implementation of the DQN training loop. You have to refactor this loop. The main things that need to change is the moment the training happens. Instead of every step we only train after each episode.

1. Change the call to `select_action` according to the new implementation.
2. Remove the train block from the inner loop.
3. Add a train block after the episode is finished.
    - Use `experience_buffer.get_all()` to get the trajectory of the episode.
    - Call your new `train` function to train on the trajectory.
    - Clear the buffer afterwards with the `experience_buffer.reset()` function.
4. Store the weights of both estimators.
    - Weights of actor should be stored in `lunar-actor.h5`.
    - Weights of critic should be stored in `lunar-critic.h5`.
    
Good luck, you are almost there.

In [None]:
start_time = datetime.now()

episode = 0
while (datetime.now() - start_time).total_seconds() < max_time:
    episode_length, episode_score = 0, 0

    observation, done = env.reset(), False
    while not done:
        ### START CODE ###
        
        # Select an action using the epsilon-greedy policy
        action = select_action(env, estimator, observation, epsilon)
        
        # Take action and get reward
        next_observation, reward, done, _ = env.step(action)

        # Add to experience buffer
        experience_buffer.append(observation, action, reward, next_observation, done)
        
        ### END CODE ###

        if experience_buffer.is_filled(batch_size):
            ### START CODE ###
            
            # Select a random batch from experience
            observations, actions, rewards, next_observations, dones = experience_buffer.sample(batch_size)

            # Perform one train step on the batch
            train(estimator, observations, actions, rewards, next_observations, dones, gamma)
            
            ### END CODE ###

        # Prepare for next iteration
        observation = next_observation
        episode_length += 1
        episode_score += reward
    
    # Show statistics
    t = (datetime.now() - start_time).total_seconds()
    print(f'[{t:.1f}] #{episode}: length {episode_length}, score {episode_score:.0f}')

    episode += 1

# Save the end result
estimator.save_weights('lunar.h5')

Again the average score should have increased, but it could be more noisy this time.

Time to evaluate the result.

## Evaluation

Evaluating the result is actually very similar as before. In the DQN exercise the evaluation loop selected the action with the highest value. With Actor-Critic we now have an actor that returns action probabilities. Now we only need to select the action with the highest probability. So the refactoring of this code is mainly cosmetic and is optional.

- Rename `estimator` to `actor`.
- Rename `action_values` to `action_probabilities`.

In [None]:
from utils import create_frame, update_frame

def evaluate(estimator, env):
    ### START CODE ###

    # Reset environment
    observation = env.reset()
    
    ### END CODE ###

    done, score = False, 0
    frame = create_frame(env)

    while not done:
        ### START CODE ###
        
        # Greedy select an action based on the estimated action values
        action = select_action(env, estimator, observation, 0)
        
        # Perform action
        observation, reward, done, _ = env.step(action)
        
        ### END CODE ###
        score += reward
        update_frame(frame)

    return score

The advantage of the Actor-Critic method is that during training we need the Critic, but once trained we only need the Actor.

Now, we can first evaluate the initial (random) policy. We only need an actor here. So let's create a new `PolicyEstimator` to start with a fresh, random neural network and evaluate that.

In [None]:
actor = PolicyEstimator(env)
score = evaluate(actor, env)
print(f'initial model: {score:.1f}')

Then we can compare it with the actor that was trained by you.

In [None]:
actor.load_weights('lunar-actor.h5')
score = evaluate(actor, env)
print(f'partially trained model: {score:.1f}')

It performs nowhere as good as it did during the DQN exercise. That is mainly cause by the different way of training. We now need more samples (i.e. episodes) to get the critic trained before it can steer the actor in the right direction. In the lecture you will see what can be done to improve this.

Now finally, compare it with an actor that has been trained with an improved algorithm (PPO).

In [None]:
actor.load_weights('trained/lunar-ppo-actor.h5')
score = evaluate(actor, env)
print(f'fully trained model: {score:.1f}')

## Conclusion

That's it! You are done. You just transformed a DQN algorithm to an Actor-Critic algorithm. This model did not perform really well ... yet. It is, however, the starting point of much more efficient algorithms that will be discussed in the lectures.

Well done!