# 3. Deep Q Network

Welcome to the third exercise in this series. In the previous exercises we worked with relatively simple environments that were solvable quickly because the number of states were limited. In this exercise we will create an agent that learns a policy for an environment with lots of states. We will still be estimating the action values, but instead of storing them in a table we will be approximating them with a neural network.

This exercise is going to be big compared to the others, so buckle up!

Let's first include a number of necessities.

In [None]:
import gym
import numpy as np
from datetime import datetime
from utils import ExperienceBuffer
import matplotlib.pyplot as plt

Later in this notebook we want to display a few screens to you, but when you run this on a server there is no display available, so we need to start a virtual display.

In [None]:
try:
    from pyvirtualdisplay import Display
    disp = Display(visible=False, size=(400, 300)).start()
except:
    print('Virtual display not used')

## Lunar lander

We will be using another Gym environment, this time we will use the `LunarLander` environment.

In [None]:
env = gym.make('LunarLander-v2')
env._max_episode_steps = 500

print(f'Observation: {env.observation_space}, {env.observation_space.dtype}')
print(f'Action:      {env.action_space}, {env.action_space.dtype}')

As you can see the observation/state of this environment is a `Box`, which basically means a multi-dimensional array. This time with one dimension (i.e. it is a vector) of 8 floating point values instead of a single integer. This means that the number of combinations of values is near infinite and not feasible to estimate with a table. We will have to use function approximation.

The action space is (like before) an integer with 4 discrete values.

**Note**: the change in maximum episode steps is done to improve training speed in this exercise. With the default of 1000 steps the training time would be significantly longer and we don't have time for that. It won't affect the end result too much.

We can render the current observation to an image to get a better feel for what the environment looks like.

In [None]:
observation = env.reset()
print(observation)
plt.figure(figsize=(8,6))
plt.imshow(env.render('rgb_array'))

You can see there is a moon lander on the top of the screen and a landing zone marked with two flags.

The goal of our agent is to land the lander on the target zone. The lander has 3 thrusters, one vertical (downwards) and two horizontal to the left and right. The agent has 4 actions, one for firing each thruster and one action to do nothing. So there is at most one thruster active at all times.

If it lands too fast, it crashes and you'll get a negative reward. If it lands safely you get a positive reward. For all the fuel you spend you will get a penalty as well. But we don't have a model, so we pretend we don't know this ;).

The observation you get is a list of distances and velocities compared to the ground. It is not really important which value is what, because our agent will figure that out by itself.

## TensorFlow

<img src="figures/dqn architecture.png" align="right" width="33%" /> We want to implement a function that can approximate the action values for each observation, so we can implement a greedy policy. We will implement this with a neural network and we will use [TensorFlow](https://www.tensorflow.org/) (version 2.3) for this.

TensorFlow is an open-source platform for machine learning and is, together with PyTorch, the most popular. It has a nice layered API that allow you to work with the bare computational graphs up to pretrained models. We will be using the higher level [Keras Functional API](https://www.tensorflow.org/guide/keras/functional), which is the recommended way of working with TensorFlow (since 2.0).

The network architecture that we are going to use is as depicted on the right. It will consist of an input layer with the size of the observation (8), then 3 hidden layers with decreasing size, and finally an output layer that gives the values for each action (4). In TensorFlow these will be combined into a model, which can be used for inference (i.e. predict values) and training. Let's start by making the model architecture first and worry about training it later.

First include the necessary TensorFlow components.

In [None]:
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Lambda
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
# Disable eager execution for performance
tf.compat.v1.disable_eager_execution()

Note: [Eager Execution](https://www.tensorflow.org/guide/eager) is a handy feature of TensorFlow during development, but it also reduces the performance significantly, so we turn it off.

### Prediction model

Now we will build the model needed for predicting/approximating action values.

The Keras Functional API works by creating an instance of a layer class and then calling (i.e. invoking) that instance with other layers as parameters. The functional API means that the instances are callable functions. Finally, the input layer and the output layer of your neural network have to be passed to the constructor of the `Model` class in order to create a TensorFlow model that can be used for inference and training.

In pseudo-code this looks like:
    
    i = Input(...)
    h = Dense(...)(i)
    o = Dense(...)(h)
    m = Model(i, o)

In this exercise you will only need to use the layer classes [`Input`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Input) and [`Dense`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense), and combine them in a model with the [`Model`](https://www.tensorflow.org/api_docs/python/tf/keras/models/Model) class. Be sure to set the parameter `name` for all the layers, this will make it easier to see the structure later on.

    Dense(..., name='hidden2')

Furthermore, all hidden layers should use `activation` parameter `'tanh'` and the output layer should use `activation` parameter `'linear'` because we want to predict a value with no range limitation.

Let's code.

In [None]:
def build_prediction_model():
    ### START CODE ###
    
    # Create an Input layer with shape (8,), i.e. the shape of the observation
    # It doesn't required any other layers as input
    observation = ...
    
    # Create three Dense layers with the appropriate sizes.
    # All should use the activation function 'tanh'.
    # Be extra careful to pass the right input objects to the layers.
    dense1 = ...
    dense2 = ...
    dense3 = ...
    
    # Create on final Dense layer that will be the output layer.
    # Its size should be the number of actions.
    # Activation function is 'linear'
    action_values = ...
    
    # Finally create a model with the input layer for the 'inputs' parameter
    # and the last layer for the 'outputs' parameter
    model = ...
    
    ### END CODE ###
    return model

Now, let's see if you implemented it correctly. If you run the next cell it will create a model and show a summary of the structure. The `name` parameters you used will be visible here, making it easier to check if everything is correct.

In [None]:
model_test = build_prediction_model()
model_test.summary()

The output should look similar to:

    Model: "lunar-dqn"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    observation (InputLayer)     [(None, 8)]               0         
    _________________________________________________________________
    dense1 (Dense)               (None, 256)               2304      
    _________________________________________________________________
    dense2 (Dense)               (None, 128)               32896     
    _________________________________________________________________
    dense3 (Dense)               (None, 64)                8256      
    _________________________________________________________________
    action_values (Dense)        (None, 4)                 260       
    =================================================================
    Total params: 43,716
    Trainable params: 43,716
    Non-trainable params: 0
    _________________________________________________________________

Compare it carefully, the total number of parameters should match exactly!

### Predicting
With this model we can do predictions using the [`predict_on_batch`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict_on_batch) function. The model is initialized randomly so it won't predict anything useful, but at least we can see it work.

TensorFlow models are designed to be used with batches of data. So if you want to use it with a single element (i.e. one observation), you have to change the shape of the input data. In this case we have an observation of shape `(8,)` but the model requires something of the shape `(?, 8)`. This can be done simply by using the function [`np.expand_dims`](https://numpy.org/doc/stable/reference/generated/numpy.expand_dims.html), with `axis=0`. This will transform our observation from `(8,)` to `(1,8)`.

**Note!** TensorFlow development is still in the middle of transferring all APIs from the v1 core to the v2 core. So you can get some warnings about deprecated functions. You can safely ignore them.

In [None]:
### START CODE ###

# Reset the environment, to get the initial observation
observation = ...
print(f'observation shape was {observation.shape}')

# Reshape the observation to a single batch
observation = ...
print(f'observation shape is now {observation.shape}')

# Call the function `predict_on_batch` with the observation as input to get a prediction for the action values
action_values = ...

### END CODE ###
action_values

You will now see and array with 4 action values. They don't make any sense yet, because we haven't trained the model yet.

### One-hot vectors

As explained during the lecture, and similar to the previous exercise, we want to update the approximations for every interaction the agent makes with the environment. This means that for every item in our training set we will have an action value for only one action, not for all 4. So far, our model outputs 4 values, and will need 4 values as target values when training.

So to be able to train our model, we will have to extend our model to only output the value of a selected action.

There is no built-in layer in TensorFlow that can do this, but we can simply implement it ourselves with a [`Lambda`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Input) layer. This is a layer which accepts a function (e.g. a python `lambda` function) that is executed for every input the layer gets.

A simple way to get the action value for a single action is by multiplying the output of the prediction model with the so-called one-hot vector of the action selected. A one-hot vector is a vector with all zeroes except for one item which is set to one. This can be used to convert integers to vectors and vice-versa. The helper function [`to_categorical`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical) can do this for you. This multiplication will result in a vector with all zeroes, except for the action you selected. If you take the sum of this row, then you get only the value for the selected action.

Let's take a look at the following example.

In [None]:
action = 2
action_onehot = to_categorical(action, num_classes=4)
print(f'action_onehot = {action_onehot}')
print(f'action_values = {action_values}')

# Multiply the action values with the one hot vector to mask out all other values.
action_value = action_values * action_onehot
print(f'action_value = {action_value}')

# Sum the columns (axis=1) to get only the selected action value.
# Note: the keepdims parameter will make sure the dimensions don't change from (N,4) to (N,), but stay (N,1),
# which is needed for the next layers in the model.
action_value = np.sum(action_value, axis=1, keepdims=True)
print(f'action value for selected action is {action_value}')

### Training model
This we will use in our `Lambda` layer. Be aware that the `Lambda` layer should use TensorFlow functions instead of numpy functions, otherwise the computation graph that TensorFlow builds internally will not be correct. Luckily, TF includes alternatives to a lot of numpy functions in its `tf.keras.backend` package which we included with name `K`. So, the `np.sum` function is available to you as [`K.sum`](https://www.tensorflow.org/api_docs/python/tf/keras/backend/sum) and behaves the same as in aboves example.

So next to the observation our training model will also need the action for which we want to train the value. This is a new `Input` layer.

The nice thing about the Keras Functional API is that you can use another model inside your model, so we can simply call the previous model with our observation input and get all the action values as predicted by that model. TensorFlow will automatically include this sub-model in it's internal computation graph, so the weights of this included sub-model are also updated during training.

In [None]:
def build_training_model(model_predict):
    ### START CODE ###
    
    # Create two input layers.
    # one for the observation (with the same shape as before) and 
    observation = ...
    # one for the one-hot action (i.e. shape (4,))
    action = ...

    # Call the prediction model object as if it is a layer instance with the observation as input to get all action values
    action_values = ...
    
    # Implement a Lambda layer similar to the numpy example above.
    action_value = Lambda(lambda x: ..., name='action_value')([action_values, action])
    
    # Create a new Model instance with BOTH the observation and action as input.
    # Use an array ([]) for this.
    # Use the output of the lambda layer as output to this model.
    model_train = ...
    
    ### END CODE ###
    return model_train

### Compilation

Finally, all we have to do now is a so-called compilation of the model. This will prepare the TF internals for training. This can simply be done with the `compile` function of the ['Model'](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile) class.

This function required two parameters, the loss function and an optimizer. We have one scalar value as output, and we will get one scalar value as target, so we will use the "mean squared error" as loss function, shorthanded to simply `'mse'`. And we will use the [`Adam`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) optimizer, which is the de-facto standard optimizer for machine learning. It basically extends the *gradient descent* method with momentum, so the gradients don't change too quickly, making training more smoothly. Except for the learning rate we can use the default parameters.

In [None]:
def compile_training_model(model_train, learning_rate):
    ### START CODE ###
    # Compile the model by calling the `compile` function.
    # Use an instance of the Adam class with the given learning rate as `optimizer`.
    # Use the mean squared error `loss` function.
    ...
    
    ### END CODE

### Test

Now let's see if you implemented all functions correctly.

In [None]:
model_predict = build_prediction_model()
model_train = build_training_model(model_predict)
compile_training_model(model_train, 1e-3)
model_train.summary()

The output should be similar to:

    Model: "lunar-dqn-train"
    __________________________________________________________________________________________________
    Layer (type)                    Output Shape         Param #     Connected to                     
    ==================================================================================================
    observation (InputLayer)        [(None, 8)]          0                                            
    __________________________________________________________________________________________________
    lunar-dqn (Functional)          (None, 4)            43716       observation[0][0]                
    __________________________________________________________________________________________________
    action (InputLayer)             [(None, 4)]          0                                            
    __________________________________________________________________________________________________
    action_value (Lambda)           (None, 1)            0           lunar-dqn[0][0]                  
                                                                     action[0][0]                     
    ==================================================================================================
    Total params: 43,716
    Trainable params: 43,716
    Non-trainable params: 0
    __________________________________________________________________________________________________

You can see that the prediction model is used as layer and has all its parameters included in the list of trainable parameters. This training model actually doesn't add any learnable layers, to it doesn't add any trainable parameters.

You can also see that the output of this model is no longer a 4 column vector, but only a single value, the action value of the selected action.

Let's see if it works, by predicting a single example. (Simply run this cell.)

In [None]:
# Reset the environment and make a batch of the initial observation
observation = env.reset()
observation = np.expand_dims(observation, axis=0)

# Predict the action values with the prediction model
action_values = model_predict.predict_on_batch(observation)
print(f'action values predicted by prediction model = {action_values}')

# Select action 2 and make it a one-hot vector (batch)
action = to_categorical([2], num_classes=4)
print(f'one hot {action}')

# Predict the single action value with the training model
action_value = model_train.predict_on_batch([observation, action])
print(f'value of action 2 predicted by training model = {action_value}')

You should note that the value predicted for action `2` is exactly the same for both models. This shows that the weights of the prediction model are used by the training model.

## $\epsilon$-greedy policy

Like in the previous exercise we will need an $\epsilon$-greedy policy to explore during training. This time the policy will be based on the output of the prediction model (`model_predict`) given an observation. Implement the epsilon greedy policy using [`np.random.uniform`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html), `env.action_space.sample()`, [`predict_on_batch`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict_on_batch), and [`np.argmax`](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html).

In [None]:
def epsilon_greedy_policy(env, model_predict, observation, epsilon):
    ### START CODE ###
    
    # Implement epsilon greedy algorithm.
    # With probability `epsilon` take a random action,
    # otherwise take the action with the highest probability.
    ...
        
    ### END CODE ###
    return action

## Training data

Training of the model will be done on batches of experience (i.e. actor-environment interactions). These batches constain:

- `observation`: The (current) observation, i.e. $S_t$
- `action`: The action taken, i.e. $A_t$
- `reward`: The reward returned by the environment after performing the action, i.e. $R_{t+1}$
- `next_observation`: The next observation after the action is performed, i.e. $S_{t+1}$
- `done`: Wheter or not the episode if finished. In other words, if the `next_observation` is of a terminal state.

From such a batch we have to compute the target values for the neural network. This means we have to compute $Q(S_t,A_t)$. Like before, we will be using the the TD(0) algorithm, so our target will be bootstrapped with the prediction of the next observation.

$$ R_t + \gamma \max_a Q(S_{t+1}, a) $$

Implement that in the following function. You have to use [`predict_on_batch`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict_on_batch) to get the action values for the next observation. Then use [`np.amax`](https://numpy.org/doc/stable/reference/generated/numpy.amax.html) to get the value of the action with the highest value.

The value of terminal states is always 0 (remember $G_T = 0$), so $Q(S_T, a)$ must be 0 as well. You can use the `dones` array to correctly filter out those values. `dones` is an array of booleans, but similar to the C programming language you can use them as integers. A `True` value is processed as a `1` and a `False` value is processed as `0`. So, if you multiply the maximum action value (`max_next_q`) with `(1-dones)` you will end up with a valid maximum action value in samples that do not end in a terminal state, otherwise the result is 0. 

In [None]:
def compute_target_values(model_predict, rewards, next_observations, dones, gamma):
    ### START CODE ###
    
    # Estimate the action values for the next observation using the prediction model
    next_q = ...
    
    # Find the maximum action value (use `axis` 1)
    # Make sure to set `keepdims=True` so the number of dimensions of the array don't change.
    max_next_q = ...
    
    # Compute the target action value as described above.
    # Take care of the terminal state
    target_q = ...
    
    ### END CODE ###
    return target_q

Time to test this. Run the following cell and see if your implementation is correct. A set of files are used to make sure the situation is always the same.

In [None]:
model_predict = build_prediction_model()
model_predict.load_weights('data/lunar_test_model.h5')

r = np.load('data/lunar_test_rewards.npy')
o = np.load('data/lunar_test_next_observations.npy')
d = np.load('data/lunar_test_dones.npy')
compute_target_values(model_predict, r, o, d, 0.99)

The output of this test should be exactly equal to:

    array([[  61.207016],
           [ -26.601854],
           [-100.      ]], dtype=float32)

Note that it's shape is `(3,1)`, i.e. a single value for each given sample.
If your output is not the same, then check your implementation! Make sure the shapes of the intermediate steps are correct. Check this using for instance `print(max_next_q.shape)`.

## Training step

We now have nearly all the pieces in place to start training. The last missing piece is the training step itself.

TensorFlow models have a very simple way of training, you can simply call the function [`fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) with the inputs for the model and a target output and it will incrementally change the weights of the model so the output fits the given target values. It will do the following steps automatically.
- split the input data into smaller mini-batches,
- compute the output of the network for the given input data (i.e. predict),
- compute the loss of the output using the given target values,
- compute the gradients of the network weights with respect to the loss,
- update the weights of the network using the selected optimizer

Performing one such sequence is called one epoch, and the `fit` function will do a number of epochs. For our RL problem we only want to run one epoch and then gather new data with an updated model.

Implement the next function. It receives both models as parameters, the predict and train models, the predict model is needed to compute the target values (using `compute_target_values`) and the train model is used to actually train. Furthermore it gets the a batch of data (`observations`, `actions`, `rewards`, `next_observations`, `dones`) and the hyperparameter `gamma` ($\gamma$).

In [None]:
def train(model_predict, model_train, observations, actions, rewards, next_observations, dones, gamma):
    ### START CODE ###

    # Compute target values
    target_q = ...

    # Fit the train model to this data.
    # It needs both the `observations` and `actions` as parameter 'x' (use an array [])
    # and it needs `target_q` as parameter 'y'.
    # Run for only 1 epoch, and turn off output with verbose=0.
    ...
    
    ### END CODE ###

## Training loop

Finally, we can implement the training loop. Let's define a number of hyperparameters for our model.

In [None]:
learning_rate = 3e-5
epsilon = 0.3
gamma = 0.99
batch_size = 256
buffer_size = 20000
nr_episodes = 150

Now let's build the prediction and training models, so we can start fresh.

In [None]:
model_predict = build_prediction_model()
model_train = build_training_model(model_predict)
compile_training_model(model_train, learning_rate)

Training is done while performing actions, but the training data itself consists of random samples for earlier gathered experience. We will use an experience buffer for that which is already implemented in the [utils.py](utils.py) file. If you want you can take a look how it works. It is basically a big queue of data with a maximum capacity. When new data is added the oldest data is removed/overwritten.

In [None]:
experience_buffer = ExperienceBuffer(buffer_size, (8,), (4,))

The main training loop basically consists of three steps.
1. Perform an action by using the $\epsilon$-greedy policy.
2. Store experience in a buffer.
3. Take a random batch from the experience buffer and train on it.

When an episode is finished we start a new episode and continue with these steps.
Fill in the required steps below and run it. It will show you the progess it makes per episode. You should see the scores go up on average.

**Running this training loop can take up to 15 minutes.** Take a coffee/tea break and come back later to see the results.

In [None]:
start_time = datetime.now()

for episode in range(nr_episodes):
    episode_length, episode_score = 0, 0

    observation, done = env.reset(), False
    while not done:
        ### START CODE ###
        # Select an action using the epsilon-greedy policy
        action = ...
        
        # Take action and get reward
        next_observation, reward, done, _ = ...
        ### END CODE ###

        # Convert action to one-hot vector
        selected_action = to_categorical(action, num_classes=4)

        # Add to experience buffer
        experience_buffer.append(observation, selected_action, reward, next_observation, done)
        if experience_buffer.is_filled(batch_size):
            # Select a random batch from experience
            observations, actions, rewards, next_observations, dones = experience_buffer.sample(batch_size)

            ### START CODE ###
            # Perform one train step on the batch
            ...
            ### END CODE ###

        # Prepare for next iteration
        observation = next_observation
        episode_length += 1
        episode_score += reward

    # Show statistics
    t = (datetime.now() - start_time).total_seconds()
    print(f'[{t:.1f}] #{episode}: length {episode_length}, score {episode_score:.0f}')

# Save the end result
model_predict.save('lunar.h5')

t = (datetime.now() - start_time).total_seconds()
print(f'Finished in {t:.1f} seconds')

Pfew, it finally finished training. By now the model should have been trained for 150 episodes and the score should on average have been increasing. Time to evaluate

If it didn't increase at all, then you are one of the unlucky few that had a model converged on a local (sub-)optimum. This sometimes happen in deep reinforcment learning, and in particular in unstable environments like this. There are ways to prevent this, like choosing less agressive learning rates and bigger batch sizes, but those will increase training time a lot. For now, the only thing you can do is to restart the entire notebook. If you have the time (maybe while listening to the next part of the lectures) you can click on `Kernel`->`Restart & Run All` and try again. But first read on to the end.

## Evaluation

It took quite some time to train and that is only for a relatively small number of episodes. Let's see how well it performs anyway. We will use the following helper function to evaluate a model by taking greedy actions only (no more exploring, only exploiting) and reporting the final score. In the meantime we'll render the environment to show you what happens.

In [None]:
from utils import create_frame, update_frame

def evaluate(model, env):
    # Reset environment
    obs, done, score = env.reset(), False, 0
    frame = create_frame(env)

    while not done:
        # Predict action values
        q = model.predict_on_batch(np.expand_dims(obs, axis=0))[0]
        # Select action with highest value
        action = np.argmax(q)

        # Perform action
        obs, reward, done, _ = env.step(action)
        score += reward

        update_frame(frame)

    return score

Let's first see how the randomly initialized model performs. It should be really bad.

In [None]:
score = evaluate(build_prediction_model(), env)
print(f'initial model: {score:.1f}')

Now let's see how well your trained model performs. We'll load the model as saved at the end of your training loop.

In [None]:
from tensorflow.keras.models import load_model
score = evaluate(load_model('lunar.h5'), env)
print(f'partially trained model: {score:.1f}')

It should perform a lot better. It should no longer instantly crash and burn. But it performs nowhere near optimal. For that we need to train it a lot longer. Here is the performance of a model that was trained for 500 episodes with a learning rate of 3e-5 and then another 500 episodes with learning rate 1e-5. This took about 80 minutes.

In [None]:
score = evaluate(load_model('trained/lunar.h5'), env)
print(f'fully trained model: {score:.1f}')

## Conclusion

Pfew, that's it. We have trained a model-free agent. Purely by interacting with the environment it learned what actions are best for each observation it receives. It takes quite some time to train, and it didn't event got that far. It requires a lot more time and experience to train to a decent score. But, the size of the observation is also a lot bigger than before. There are a number of things you can do to improve training time and sample efficieny (how well does it learn for a single interaction with the environment), but that's out-of-scope for this training.

In the next and final exercise we will see that you can skip a few steps and make training a lot easier.