# 3. Deep Q Network

Welcome to the third exercise in this series. In the previous exercises we worked with relatively simple environments that were solvable quickly because the number of states was limited. In this exercise we will create an agent that learns a policy for an environment with lots of states. We will still be estimating the action values, but instead of storing them in a table we will be approximating them with a neural network.

The algorithm is called "Deep Q-learning with Experience Replay" and the paper [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/pdf/1312.5602.pdf) describes it well. The pseudo code of the algorithm is as follows.

<img src="figures/dqn.png" width="75%" />

We will implement it step-by-step.

Let's first include a number of necessities.

In [None]:
import gymnasium
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt

We will be displaying some nice screenshots here, but we need to initialize the library for this.

In [None]:
from utils import init_display
init_display()

## Lunar lander

We will be using another Gym environment, this time we will use the `LunarLander` environment.

In [None]:
env = gymnasium.make('LunarLander-v2', render_mode='rgb_array')
env._max_episode_steps = 500

print(f'Observation: {env.observation_space}, {env.observation_space.dtype}')
print(f'Action:      {env.action_space}, {env.action_space.dtype}')

As you can see the observation/state of this environment is a `Box`, which basically means a multi-dimensional array. This time with one dimension (i.e. it is a vector) of 8 floating point values (instead of a single integer) each with possible value between -inf and +inf. This means that the number of combinations infinite and not feasible to estimate with a table. We will have to use function approximation.

The action space is (like before) an integer with 4 discrete values.

**Note**: the change in maximum episode steps is done to improve training speed in this exercise. With the default of 1000 steps the training time would be significantly longer and we don't have time for that. It won't affect the end result too much.

We can render the observation to an image to get a better feel for what the environment looks like.

In [None]:
observation, _ = env.reset()
for _ in range(40):
    env.step(0)
for _ in range(4):
    env.step(2)
print(observation)
plt.figure(figsize=(8,6))
plt.imshow(env.render())

You can see a moon lander and a landing zone marked with two yellow flags.

The goal of our agent is to land the lander in the target zone. The lander has 3 thrusters, one vertical (downwards) and two horizontal to the left and right. The agent has 4 actions, one for firing each thruster and one action to do nothing. So there is at most one thruster active at all times.

If it lands too fast, it crashes and you'll get a negative reward. If it lands safely you get a positive reward. For all the fuel you spend you will get a penalty as well. But, we don't have a model, so we pretend we don't know all this ;).

The observation you get is a list of distances and velocities compared to the ground. It is not really important which value is what, because our agent will figure that out by itself.

## Neural network

We will use an neural network as our function approximator. It takes too much time to also introduce the concepts of a neural network, so we will use an already existing implemention. It is implemented with [PyTorch](https://pytorch.org/). If you are finished early you can take a look at the implementation in the [`estimator.py`](estimator.py) file.

This neural network accepts observations from the environment as input, and will return predicted values for that state. I.e. it approximates the function $v(s)$.

Let's create a new instance and see it's initial prediction.

In [None]:
from estimator import ActionValueEstimator

estimator = ActionValueEstimator(env)

# Reset the environment to get the initial observation
observation, _ = env.reset()

# Predict action value for all action in this state
action_values = estimator.predict(observation)
print(f'action values = {action_values}')

You should see a list of 4 action values with random values. If you run it again the values should be different.

## $\epsilon$-greedy policy

Like in the previous exercise we will need an $\epsilon$-greedy policy to explore during training. This time the policy will be based on the output of the action value estimator model (`estimator.predict`) given an observation. Implement the epsilon greedy policy using [`np.random.uniform`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html), `env.action_space.sample()`, and [`np.argmax`](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html).

In [None]:
def select_action(env, estimator, observation, epsilon):
    ### START CODE ###
    
    # Implement epsilon greedy algorithm.
    # With probability `epsilon` take a random action,
    # otherwise take the action with the highest probability.
    ...
        
    ### END CODE ###
    return action

## Target action values

Training of the model will be done on batches of experience (i.e. actor-environment interactions). These batches constain:

- `observation`: The (current) observation, i.e. $S_t$
- `action`: The action taken, i.e. $A_t$
- `reward`: The reward returned by the environment after performing the action, i.e. $R_{t+1}$
- `next_observation`: The next observation after the action is performed, i.e. $S_{t+1}$
- `done`: Wheter or not the episode if finished. In other words, if the `next_observation` is of a terminal state.

From such a batch we have to compute the target values for the neural network. This means we have to compute $Q(S_t,A_t)$. Like before, we will be using the the Q-learning algorithm, so our target will be bootstrapped with the prediction of the next observation.

$$ R_t + \gamma \max_a Q(S_{t+1}, a) $$

Implement that in the following function. You have to use `estimator.predict(observations)` to get the action values for the next observations. Then use [`np.amax`](https://numpy.org/doc/stable/reference/generated/numpy.amax.html) to get the value of the action with the highest value. Make sure that you set `axis=1` to get the maximum of each sample and `keepdims=True` to keep the shapes of the arrays in order.

The value of terminal states is always 0 (remember $G_T = 0$), so $Q(S_T, a)$ must be 0 as well. You can use the `dones` array to correctly filter out those values. `dones` is an array of booleans, but similar to the C programming language you can use them as integers. A `True` value is processed as a `1` and a `False` value is processed as `0`. So, if you multiply the maximum action value (`max_next_q`) with `(1-dones)` you will end up with a valid maximum action value in samples that do not end in a terminal state, otherwise the result is 0. 

In [None]:
def compute_target_values(estimator, rewards, next_observations, dones, gamma):
    ### START CODE ###
    
    # Estimate the action values for the next observation using the prediction model.
    next_q = ...
    
    # Find the maximum action value.
    max_next_q = ...
    
    # Compute the target action value as described above.
    # Take care of the terminal state.
    target_q = ...
    
    ### END CODE ###
    return target_q

Time to test this. Run the following cell and see if your implementation is correct. A set of pre-defined input files is used to make sure the situation is always the same.

In [None]:
estimator = ActionValueEstimator(env)
estimator.load_weights('data/lunar_test_model.pth')

r = np.load('data/lunar_test_rewards.npy')
o = np.load('data/lunar_test_next_observations.npy')
d = np.load('data/lunar_test_dones.npy')
compute_target_values(estimator, r, o, d, 0.99)

The output of this test should be exactly equal to:

    array([[84.27161251],
           [84.83388991],
           [85.45931969]])

Note that it's shape is `(3,1)`, i.e. a single value for each of the three samples.

If your output is not the same, then check your implementation! Make sure the shapes of the intermediate steps are correct. Check this using for instance `print(max_next_q.shape)`.

## Training step

We now have nearly all the pieces in place to start training. The last missing piece is the training step itself.

The given neural network class has a dedicated function for fitting the neural network to a given dataset. It can be invoked as follows.

    estimator.train(observations, actions, values)

It uses [PyTorch](https://pytorch.org/) to automatically:
- compute the output of the network for the given input data (i.e. predict),
- compute the loss of the output using the given target values,
- compute the gradients of the network weights with respect to the loss,
- update the weights of the network using an optimizer

Each time you call `train` the estimator will change slightly toward better fitting to the given data. Each time you do this is called an epoch. In other types of machine learning you usually run multiple epochs to fit your network perfectly to the given data. For our RL problem we only want to run one epoch and then gather new data with an updated model.

Implement the `train` function below. It performs two steps. First we need to call the earlier implemented `compute_target_values()` function to get the target values for our estimator, and then call the `train` function of that estimator.

In [None]:
def train(estimator, observations, actions, rewards, next_observations, dones, gamma):
    ### START CODE ###

    # Compute target values
    target_q = ...

    # Train the estimator to this data.
    # It needs both the `observations` and `actions` as input parameters to the neural network
    # and it needs `target_q` as target for the output of the neural network
    ...
    
    ### END CODE ###

## Training

Finally, we can implement the training loop. Let's define a number of hyperparameters for our model, that we will explain later.

In [None]:
epsilon = 0.3
gamma = 0.99
batch_size = 256
max_time = 300 # seconds
max_episode_length = 1000

Let's recreate the estimator so we can start fresh.

In [None]:
estimator = ActionValueEstimator(env)

### Experience replay

As explained in the presentation, we will be using experience replay. In other words, the data gathered while interacting with the environment is stored in a big buffer with maximum capacity. From this buffer we take random batches to train on.

We will use the class `ExperienceBuffer` for this, which is implemented in the [`utils.py`](utils.py) file. If you want you can take a look how it works. It is basically a big queue of data with a maximum capacity. When new data is added the oldest data is removed/overwritten.

It has two important functions that you need to use:
- `experience_buffer.append(observation, action, reward, next_observation, done)` to append the data of a single step.
- `experience_buffer.sample(batch_size)` to get a random batch of size `batch_size`.

Let's create an instance that can store 10000 steps.

In [None]:
from utils import ExperienceBuffer
experience_buffer = ExperienceBuffer(10000, env.observation_space.shape, (1,))

### Train loop

The main training loop basically consists of three steps.
1. Perform an action by using the $\epsilon$-greedy policy. (Use `select_action` and `env.step`)
2. Store experience in the buffer. (Use `experience_buffer.append`)
3. Take a random batch from the experience buffer and train on it. (Use `experience_buffer.sample` and `train`)

When an episode is finished we start a new episode and continue with these steps.

There is only one problem in the LunarLander environment that is annoying during training. The environment does not stop automatically when the agent takes too long to finish the task. An episode can in some situations take >10000 steps, which means we have to wait quite a while before the episode is finished. To speed it all up we have to add an 'early stop' to the loop. If the episode takes more than `max_episode_length` steps it should abort.

Fill in the required steps below and run it. It will show you the progess it makes per episode. You should see the scores go up on average.

**Running this training loop takes 5 minutes.** Take a coffee/tea break and come back later to see the results.

In [None]:
start_time = datetime.now()

episode = 0
while (datetime.now() - start_time).total_seconds() < max_time:
    episode_length, episode_score = 0, 0

    observation, _ = env.reset()
    done = False
    while not done:
        ### START CODE ###
        
        # Select an action using the epsilon-greedy policy
        action = ...
        
        # Take action and get reward
        next_observation, reward, done, _, _ = ...
        
        # End early
        if episode_length >= max_episode_length - 1:
            done = True

        # Add to experience buffer
        ...
        
        ### END CODE ###

        if experience_buffer.is_filled(batch_size):
            ### START CODE ###
            
            # Select a random batch from experience
            observations, actions, rewards, next_observations, dones = ...

            # Perform one train step on the batch
            ...
            
            ### END CODE ###

        # Prepare for next iteration
        observation = next_observation
        episode_length += 1
        episode_score += reward
    
    # Show statistics
    t = (datetime.now() - start_time).total_seconds()
    print(f'[{t:.1f}] #{episode}: length {episode_length}, score {episode_score:.0f}')

    episode += 1

# Save the end result
estimator.save_weights('lunar-dqn.pth')

Pfew, it finally finished training. By now the model should have been trained for 5 minutes or about 150+ episodes and the score should on average have been increasing. Time to evaluate.

**Note:**
If it didn't increase at all, then you are one of the unlucky few that had a model converged on a local (sub-)optimum. This sometimes happen in deep reinforcment learning, and in particular in unstable environments like this. There are ways to prevent this, like choosing less agressive learning rates and bigger batch sizes, but those will increase training time a lot. For now, the only thing you can do is to restart the entire notebook. If you have the time (maybe while listening to the next part of the lectures) you can click on `Kernel`->`Restart & Run All` and try again. But first read on to the end.

## Evaluation

It took quite some time to train and that is only for a relatively small number of episodes. Let's see how well it performs anyway. We will use the following helper function to evaluate a model by taking greedy actions only (no more exploring, only exploiting) and reporting the final score. In the meantime we'll render the environment to show you what happens.

Implement the missing parts. You can take a look at the "greedy" part of the `select_action` function for some additional information.

In [None]:
from utils import create_frame, update_frame

def evaluate(estimator, env):
    ### START CODE ###

    # Reset environment
    observation, _ = ...
    
    ### END CODE ###

    done, score, length = False, 0, 0
    frame = create_frame(env)

    while not done:
        ### START CODE ###
        
        # Greedy select an action based on the estimated action values
        action = ...
        
        # Perform action
        observation, reward, done, _, _ = ...
        
        ### END CODE ###
        
        # End early
        if length >= max_episode_length - 1:
            done = True

        score += reward
        length += 1
        update_frame(frame)

    return score

Let's first see how the randomly initialized model performs. It should be really bad.

In [None]:
estimator = ActionValueEstimator(env)
score = evaluate(estimator, env)
print(f'initial model: {score:.1f}')

Now let's see how well your trained model performs. We'll load the model as saved at the end of your training loop.

In [None]:
estimator.load_weights('lunar-dqn.pth')
score = evaluate(estimator, env)
print(f'partially trained model: {score:.1f}')

It should perform a lot better. It should no longer instantly crash and burn. But it performs nowhere near optimal. For that we need to train it a lot longer. Here is the performance of a model that was trained for 4500 episodes with a decreasing value of epsilon (epsilon decay) to assist in training the later stages. This took more than an hour.

In [None]:
estimator.load_weights('trained/lunar-dqn.pth')
score = evaluate(estimator, env)
print(f'fully trained model: {score:.1f}')

## Conclusion

We have trained a model-free agent. Purely by interacting with the environment it learned what actions are best for each observation it receives. It takes quite some time to train, and it didn't event got that far. It requires a lot more time and experience to train to a decent score. But, the size of the observation is also a lot bigger than before. There are a number of things you can do to improve training time and sample efficieny (how well does it learn for a single interaction with the environment), but that's out-of-scope for this training.

In the next and final exercise we will implement another way to train an agent using a neural network that allows more complex environments.

If you are interested you can also take a look at the implemention of the neural network in the [`estimator.py`](estimator.py) file.