# Improving the Double DQN algorithm using prioritized experience replay 
> Notes on improving the Double DQN algorithm using prioritized experience replay.

- branch: 2020-04-14-prioritized-experience-replay
- badges: true
- image: images/prioritized-experience-replay.png
- comments: true
- author: David R. Pugh
- categories: [pytorch, deep-reinforcement-learning, deep-q-networks]

## Motivation

I am currently using my COVID-19 imposed quarantine to expand my deep learning skills by completing the [*Deep Reinforcement Learning Nanodegree*](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) from [Udacity](https://www.udacity.com/). This past week I have been working my way through the seminal 2015 *Nature* paper from researchers at [Deepmind](https://deepmind.com/) [*Human-level control through deep reinforcement learning*](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) (Minh et al 2015).

### Why is Minh et al 2015 important?

While Minh et al 2015 was not the first paper to use neural networks to approximate the action-value function, this paper was the first to demonstrate that the same neural network architecture could be trained in a computationally efficient manner to "solve" a large number or different tasks.

The paper also contributed several practical "tricks" for getting deep neural networks to consistently converge during training. This was a non-trivial contribution as issues with training convergence had plaugued previous attempts to use neural networks as function approximators in reinforcement learning tasks and were blocking widespread adoption of deep learning techniques within the reinforcemnt learning community.

## Summary of the paper

Minh et al 2015 uses deep (convolutional) neural network to approximate the optimal action-value function

$$ Q^*(s, a) = \max_{\pi} \mathbb{E}\Bigg[\sum_{s=0}^{\infty} \gamma^s r_{t+s} | s_t=s, a_t=a, \pi \Bigg] $$

which is the maximum sum of rewards $r_t$ discounted by $\gamma$ at each time-step $t$ achievable by a behaviour policy $\pi = P(a|s)$, after making an observation of the state $s$ and taking an action $a$. 

Prior to this seminal paper it was well known that standard reinforcement learning algorithms were unstable or even diverged when a non-linear function approximators such as a neural networks were used to represent the action-value function $Q$. Why? 

Minh et al 2015 discuss several reasons.

1. Correlations present in the sequence of observations of the state $s$. In reinforcement learning applications the sequence state observations is a time-series which will almost surely be auto-correlated. But surely this would also be true of any application of deep neural networks to model time series data. 
2. Small updates to $Q$ may significantly change the policy, $\pi$ and therefore change the data distribution.
3. Correlations between the action-values, $Q$, and the target values $r + \gamma \max_{a'} Q(s', a')$

In the paper the authors address these issues by using...

* a biologically inspired mechanism they refer to as *experience replay* that randomizes over the data which removes correlations in the sequence of observations of the state $s$ and smoothes over changes in the data distribution (issues 1 and 2 above).
* an iterative update rule that adjusts the action-values, $Q$, towards target values, $Q'$ that are only periodically updated thereby reducing correlations with the target (issue 3 above).

### Approximating the action-value function, $Q(s,a)$

There are several possible ways of approximating the action-value function $Q$ using a neural network. The only input to the DQN architecture is the state representation and the output layer has a separate output for each possible action. The output units correspond to the predicted $Q$-values of the individual actions for the input state. A representaion of the DQN architecture from the paper is reproduced in the figure below. 

<center>
    <img src=images/q-network-architecture.jpg width=1000>
</center>

The input to the neural network consists of an 84 x 84 x 4 image produced by the preprocessing map $\phi$. The network has four hidden layers:

* Convolutional layer with 32 filters (each of which uses an 8 x 8 kernel and a stride of 4) and a ReLU activation function. 
* Convolutional layer with 64 filters (each of which using a 4 x 4 kernel with stride of 2) and a ReLU activation function. 
* Convolutional layer with 64 filters (each of which uses a 3 x 3 kernel and a stride of 1) and a ReLU activation function. 
* Fully-connected (i.e., dense) layer with 512 neurons followed by a ReLU activation function.

The output layer is another fully-connected layer with a single output for each action. A PyTorch implementation of the DQN architecture would look something like the following.

In [None]:
import typing

import torch
from torch import nn


DeepQNetwork = nn.Module
DeepQNetworkFn = typing.Callable[[], DeepQNetwork]


class LambdaLayer(nn.Module):
    
    def __init__(self, f):
        super().__init__()
        self._f = f
        
    def forward(self, X):
        return self._f(X)


def make_deep_q_network_fn(action_size: int) -> DeepQNetworkFn:
    
    def deep_q_network_fn() -> DeepQNetwork:
        q_network = nn.Sequential(
            nn.Conv2d(in_channels=4, out_channels=32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=2, stride=1),
            nn.ReLU(),
            LambdaLayer(lambda tensor: tensor.view(tensor.size(0), -1)),
            nn.Linear(in_features=25024, out_features=512),
            nn.ReLU(),
            nn.Linear(in_features=512, out_features=action_size)
        )
        return q_network
    
    return deep_q_network_fn

Since, for this project at least, I am not learning directly from pixels/images but am rather 
working with a preprocess representation of the environment state I decided to implement a three 
layer dense neural network. I use an outer function to close over the parameters defining the 
network architecture: the number of states and the number of actions are determined by the 
environment so the the only hyperparameter is the number of hidden units.

In [None]:
def make_deep_q_network_fn(number_states: int,
                           number_actions: int,
                           number_hidden_units: int) -> agents.DeepQNetworkFn:
    """Create a function that returns a DeepQNetwork with appropriate input and output shapes."""
    
    def deep_q_network_fn() -> agents.DeepQNetwork:
        deep_q_network = nn.Sequential(
            nn.Linear(in_features=number_states, out_features=number_hidden_units),
            nn.ReLU(),
            nn.Linear(in_features=number_hidden_units, out_features=number_hidden_units),
            nn.ReLU(),
            nn.Linear(in_features=number_hidden_units, out_features=number_actions)
        )
        return deep_q_network
    
    return deep_q_network_fn


## Double Deep Q-Network (DDQN) Algorithm

In this section I discuss and implement the Double DQN algorithm from 
[*Deep Reinforcement Learning with Double Q-Learning*](https://arxiv.org/abs/1509.06461) 
(Van Hasselt et al 2015). The DDQN algorithm is a minor, but important, modification of the original 
DQN algorithm discussed above.

The Van Hasselt et al 2015 paper makes several important contributions. 

1. Demonstration of how Q-learning can be overly optimistic in large-scale, even 
   deterministic, problems due to the inherent estimation errors of learning. 
2. Demonstration that overestimations are more common and severe in practice than previously 
   acknowledged. 
3. Implementation of Double Q-learning called Double DQN that extends, with minor 
   modifications, the popular DQN algorithm and that can be used at scale to successfully 
   reduce overestimations with the result being more stable and reliable learning.
4. Demonstation that Double DQN finds better policies by obtaining new state-of-the-art 
   results on the Atari 2600 dataset.
   
### Q-learning overestimates Q-values

No matter what type of function approximation scheme used to approximate the action-value function 
$Q$ there will always be approximation error. The presence of the max operator in the 
[Bellman equation](https://en.wikipedia.org/wiki/Bellman_equation) used to compute the $Q$-values 
means that the approximate $Q$-values will almost always be strictly greater than the corresponding 
$Q$ values from the true action-value function (i.e., the approximation errors will almost always 
be positive). This potentially significant source of bias can impede learning and is often 
exacerbated by the use of flexible, non-linear function approximators such as neural networks. 

Double Q-learning addresses these issues by explicitly separating action selection from action 
evaluation which allows each step to use a different function approximator resulting in a better 
overall approximation of the action-value function. Figure 2 (with caption) below, which is taken 
from Van Hasselt et al 2015, summarizes these ideas. See the 
[paper](https://arxiv.org/pdf/1509.06461.pdf) for more details.

<center>
    <img src=images/double-dqn-figure-2.png width=1000>
</center>
    
### Implementing the Double DQN update

The key idea behind Double Q-learning is to reduce overestimations of Q-values by separating the selection of actions from the evaluation of those actions so that a different Q-network can be used in each step. When applying Double Q-learning to extend the DQN algorithm one can use the online Q-network, $Q(S, a; \theta)$, to select the actions and then the target Q-network, $Q(S, a; \theta^{-})$, to evaluate the selected actions.

Before implement the Double DQN algorithm, I am going to re-implement the Q-learning update from the DQN algorithm in a way that explicitly separates action selection from action evaluation. Once I have implemented this new version of Q-learning, implementing the Double DQN algorithm will be much easier. Formally separating action selection from action evaluation involves re-writing the Q-learning Bellman equation as follows.

$$ Y_t^{DQN} = R_{t+1} + \gamma Q\big(S_{t+1}, \underset{a}{\mathrm{argmax}}\ Q(S_{t+1}, a; \theta_t); \theta_t\big) $$

In Python this can be implemented as three separate functions.

In [None]:
QValues = torch.Tensor


def select_greedy_actions(states: torch.Tensor, q_network: DeepQNetwork) -> torch.Tensor:
    """Select the greedy action for the current state given some Q-network."""
    _, actions = q_network(states).max(dim=1, keepdim=True)
    return actions


def evaluate_selected_actions(states: torch.Tensor,
                              actions: torch.Tensor,
                              rewards: torch.Tensor,
                              dones: torch.Tensor,
                              gamma: float,
                              q_network: DeepQNetwork) -> QValues:
    """Compute the Q-values by evaluating the actions given the current states and Q-network."""
    next_q_values = q_network(states).gather(dim=1, index=actions)        
    q_values = rewards + (gamma * next_q_values * (1 - dones))
    return q_values


def q_learning_update(states: torch.Tensor,
                      rewards: torch.Tensor,
                      dones: torch.Tensor,
                      gamma: float,
                      q_network: DeepQNetwork) -> QValues:
    """Q-Learning uses a single Q-network to select and evaluate actions."""
    actions = select_greedy_actions(states, q1)
    q_values = evaluate_selected_actions(states, actions, rewards, dones, gamma, q1)
    return q_values


From here it is straight forward to implement the Double DQN algorithm. All I need is a second action-value function. The target network in the DQN architecture provides a natural candidate for the second action-value function. Hasselt et al 2015 suggest using the online Q-network to select the greedy policy actions before using the target Q-network to estimate the value of the selected actions. Once again here are the maths...

$$ Y_t^{DoubleDQN} = R_{t+1} + \gamma Q\big(S_{t+1}, \underset{a}{\mathrm{argmax}}\ Q(S_{t+1}, a; \theta_t), \theta_t^{-}\big) $$

...and here is the the Python implementation.

In [None]:
def double_q_learning_update(states: torch.Tensor,
                             rewards: torch.Tensor,
                             dones: torch.Tensor,
                             gamma: float,
                             q1: DeepQNetwork,
                             q2: DeepQNetwork) -> QValues:
    """Double Q-Learning uses q1 to select actions and q2 to evaluate the selected actions."""
    actions = select_greedy_actions(states, q1)
    q_values = evaluate_selected_actions(states, actions, rewards, dones, gamma, q2)
    return q_values

Note that the function `double_q_learning_update` is almost identical to the `q_learning_update` function above: all that is needed is to introduce a second Q-network parameter, `q_network_2`, to the function. This second Q-network will be use to evaluate the actions chosen using the original Q-network parameter, now called `q_network_1`.

## Prioritized Experience Replay

In this section I will discuss implementation of an important enhancement of the experience replay 
idea from [*Prioritized Experience Replay*](https://arxiv.org/abs/1511.05952) (Schaul et al 2016).

The following quote from the paper nicely summarizes the key idea.

> Experience replay liberates online learning agents from processing transitions in the exact order
they are experienced. Prioritized replay further liberates agents from considering transitions with
the same frequency that they are experienced.
...
In particular, we propose to more frequently replay transitions with high expected learning progress,
as measured by the magnitude of their temporal-difference (TD) error. This prioritization can lead
to a loss of diversity, which we alleviate with stochastic prioritization, and introduce bias, which
we correct with importance sampling.

Without further ado let's dive into discussing how to implement prioritized experience replay. 
Using an experience replay buffer naturally leads to two issues that need to be addressed. 

1. Which experiences should the agent store in the replay buffer?
2. Which experiences should the agent replay from the buffer in order to learn efficiently?

Schaul et al 2016 take the contents of the replay buffer more or less as given and focus solely on 
answering the second question. That is, the paper focuses on developing a procedure for making the 
most effective use of the experience replay buffer for learning.

Before discussing the procedure to sample from prioritized experiences, I need to discuss what 
information a reinforcement learning (RL) agent has available to prioritize its experiences for 
replay.

### Prioritization using the temporal-difference (TD) error term

You can't prioritize experiences for learning unless you can measure the importance of each 
experience in the learning process. The ideal criterion would be the amount that RL agent can 
learn from experience given the current state (i.e., the expected learning value of the 
experience). 

Unfortunately such an ideal criterion is not directly measurable. However, a reasonable proxy is 
the magnitude of an experience’s temporal-difference (TD) error $\delta_i$. The TD-error 
indicates how "surprising" or "unexpected" the experience is given the current state of the RL 
agent. Using the TD-error term to prioritize experiences for replay is particularly suitable for 
incremental, online RL algorithms, such as 
[SARSA](https://en.wikipedia.org/wiki/State%E2%80%93action%E2%80%93reward%E2%80%93state%E2%80%93action) 
or [Q-learning](https://en.wikipedia.org/wiki/Q-learning), 
as these algorithms already compute the TD-error and update the parameters proportionally.

Using the notation developed above the TD-error term can be written as follows.

$$ \delta_{i,t} = R_{i,t+1} + \gamma Q\big(S_{i,t+1}, \underset{a}{\mathrm{argmax}}\ Q(S_{i,t+1}, a; \theta_t); \theta^{-}_t\big) - Q(S_{i,t}, a_{i,t}; \theta_t\big)$$

In the cell below I define a function for computing the TD-error.

In [122]:
TDErrors = torch.Tensor


def double_q_learning_error(states: torch.Tensor,
                            actions: torch.Tensor,
                            rewards: torch.Tensor,
                            next_states: torch.Tensor,
                            dones: torch.Tensor,
                            gamma: float,
                            q1: DeepQNetwork,
                            q2: DeepQNetwork) -> TDErrors:
    expected_q_values = double_q_learning_update(next_states, rewards, dones, gamma, q1, q2)
    q_values = q1(states).gather(dim=1, index=actions)
    delta = expected_q_values - q_values
    return delta

Now that I have defined a measurable criterion by which an RL agent can prioritize its 
experiences, I can move on to discussing the major contribution of the Schaul et al 2016 paper 
which was an efficient procedure for randomly sampling and replaying prioritized experiences.

### Stochastic prioritization

Schaul et al 2016 introduce a stochastic sampling method that interpolates between pure greedy 
experience prioritization (i.e., always sampling the highest priority experiences) and uniform 
random sampling of experience. The probability of sampling experience $i$ is defined as follows.

$$ P(i) = \frac{p_i^{\alpha}}{\sum_{j=0}^{N} p_j^{\alpha}} $$

where $p_i > 0$ is the priority of transition $i$. The exponent $\alpha$ determines how much 
prioritization is used, with $\alpha = 0$ corresponding to the uniform random sampling case. Note 
that the probability of being sampled is monotonic in an experience’s priority while guaranteeing 
a non-zero probability for the lowest-priority experience.

### Correcting for sampling bias

Estimation of the expected value from stochastic updates relies on those updates being drawn from 
the same underlying distribution whose expectation you wish to estimate. Prioritized experience 
replay introduces a form of sampling bias that changes the underlying distribution (whose 
expectation needs to be estimated) in an uncontrolled fashion. When the underlying distribution 
changes, the solution to which the algorithm will converge also changes (even if the policy and 
state distribution are fixed). In order for the algorithm to converge properly, the bias 
introduced by the prioritized experience replay procedure needs to be corrected.

Schaul et al 2016 correct for this bias using an importance sampling scheme that computes a weight 
for each sampled experience that can be used when computing the loss for that sample.

$$ w_i = \left(\frac{1}{N}\frac{1}{P(i)}\right)^\beta $$

The hyperparameter $\beta \ge 0$ controls how strongly to correct for the bias: $\beta=0$ implies 
no correction; $\beta=1$ fully compensates for the bias. For stability reasons, since these 
importance sampling weights are included in the loss, they are be normalized by $\max_i\ w_i$.

### Implementation

The most important implementation detail is that instead of using a 
[fixed-length, double-ended queue](https://docs.python.org/3.8/library/collections.html#collections.deque) 
as the underlying data structure for storing experiences, I am using a NumPy 
[structured array](https://docs.scipy.org/doc/numpy/user/basics.rec.html) to store 
priority-experience tuples which substantially improves the algorithm performance.


In [118]:
import collections
import typing

import numpy as np


BetaAnnealingSchedule = typing.Optional[typing.Callable[[int, float], float]]
RandomState = typing.Optional[np.random.RandomState]


def sampling_probabilities(priorities: np.ndarray, alpha: float):
    """Sampling probability is increasing function of priority"""
    return priorities**alpha / np.sum(priorities**alpha)
    

def sampling_weights(probabilities: np.ndarray, beta: float, normalize: bool):
    """Importance sampling weights correct for sampling bias introduced by prioritization."""
    n = probabilities.size
    weights = (n * probabilities)**-beta
    if normalize:
        weights = weights / weights.max()
    return weights


_field_names = [
    "state",
    "action",
    "reward",
    "next_state",
    "done"
]
Experience = collections.namedtuple("Experience", field_names=_field_names)


class PrioritizedExperienceReplayBuffer:
    """Fixed-size buffer to store priority, Experience tuples."""

    def __init__(self,
                 maximum_size: int,
                 alpha: float = 0.0,
                 beta_annealing_schedule: BetaAnnealingSchedule = None,
                 initial_beta: float = 0.0,
                 random_state: RandomState = None) -> None:
        """
        Initialize a PrioritizedExperienceReplayBuffer object.

        Parameters:
        -----------
        maximum_size (int): maximum size of buffer
        alpha (float): Strength of prioritized sampling. Default to 0.0 (i.e., uniform sampling).
        beta_annealing_schedule (BetaAnnealingSchedule): function that takes an episode number and 
            an initial value for beta and returns the current value of beta.
        random_state (np.random.RandomState): random number generator.
        
        """
        self._maximum_size = maximum_size
        self._current_size = 0 # current number of prioritized experience tuples in buffer
        _dtype = [("priority", np.float32), ("experience", Experience)]
        self._buffer = np.empty(self._maximum_size, _dtype)
        self._alpha = alpha
        self._initial_beta = initial_beta
        
        if beta_annealing_schedule is None:
            self._beta_annealing_schedule = lambda n: self._initial_beta
        else:
            self._beta_annealing_schedule = lambda n: beta_annealing_schedule(n, self._initial_beta)

        self._random_state = np.random.RandomState() if random_state is None else random_state
        
    def __len__(self) -> int:
        """Current number of prioritized experience tuple stored in buffer."""
        return self._current_size

    @property
    def alpha(self):
        """Strength of prioritized sampling."""
        return self._alpha
    
    @property
    def maximum_size(self) -> int:
        """Maximum number of prioritized experience tuples stored in buffer."""
        return self._maximum_size
    
    @property
    def initial_beta(self):
        """Initial strength for sampling correction."""
        return self._initial_beta

    def add(self, experience: Experience) -> None:
        """Add a new experience to memory."""
        priority = 1.0 if self.is_empty() else self._buffer["priority"].max()
        if self.is_full():
            if priority > self._buffer["priority"].min():
                idx = self._buffer["priority"].argmin()
                self._buffer[idx] = (priority, experience)
            else:
                pass # low priority experiences should not be included in buffer
        else:
            self._buffer[self._current_size] = (priority, experience)
            self._current_size += 1

    def is_empty(self) -> bool:
        """True if the buffer is empty; False otherwise."""
        return self._current_size == 0
    
    def is_full(self) -> bool:
        """True if the buffer is full; False otherwise."""
        return self._current_size == self._maximum_size
    
    def sample(self, batch_size: int, episode_number: int) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """Sample a batch of experiences from memory."""
        # use sampling scheme to determine which experiences to use for learning
        ps = self._buffer[:self._current_size]["priority"]
        sampling_probs = sampling_probabilities(ps, self._alpha)
        
        # use sampling probabilities to compute sampling weights
        beta = self._beta_annealing_schedule(episode_number)
        weights = sampling_weights(sampling_probs, beta, normalize=True)
        
        # randomly sample indicies corresponding to priority, experience tuples
        idxs = np.arange(sampling_probs.size)
        random_idxs = self._random_state.choice(idxs,
                                                size=batch_size,
                                                replace=True,
                                                p=sampling_probs)
        
        # select the experiences and sampling weights
        sampled_experiences = self._buffer["experience"][random_idxs]
        sampled_weights = weights[random_idxs]
        
        return random_idxs, sampled_experiences, sampled_weights

    def update_priorities(self, idxs: np.ndarray, priorities: np.ndarray) -> None:
        """Update the priorities associated with particular experiences."""
        self._buffer["priority"][idxs] = priorities


### The Loss Function

The $Q$-learning update at iteration $i$ uses the following loss function

$$ \mathcal{L_i}(\theta_i) = \mathbb{E}_{(s, a, r, s') \sim U(D)} \Bigg[\bigg(r + \gamma \max_{a'} Q\big(s', a'; \theta_i^{-}\big) - Q\big(s, a; \theta_i\big)\bigg)^2\Bigg] $$

where $\gamma$ is the discount factor determining the agent’s horizon, $\theta_i$ are the parameters of the $Q$-network at iteration $i$ and $\theta_i^{-}$ are the $Q$-network parameters used to compute the target at iteration $i$. The target network parameters $\theta_i^{-}$ are only updated with the $Q$-network parameters $\theta_i$ every $C$ steps and are held fixed between individual updates. 

In [None]:
Loss = torch.Tensor


def mse_loss(deltas: TDErrors, sampling_weights: torch.Tensor) -> Loss:
    """Compute the (weighted) mean squared loss."""
    loss = torch.mean((deltas * sampling_weights)**2)
    return loss


### The Deep Q-Network Algorithm

The following is Python pseudo-code for the Deep Q-Network (DQN) algorithm. For more fine-grained details of the DQN algorithm see the methods section of [Minh et al 2015](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf).

```python

# hyper-parameters
batch_size = 32 # number of experience tuples used in computing the gradient descent parameter update.
buffer_size = 10000 # number of experience tuples stored in the replay buffer
gamma = 0.99 # discount factor used in the Q-learning update
target_network_update_frequency = 4 # frequency (measured in parameter updates) with which target network is updated.
update_frequency = 4 # frequency (measured in number of timesteps) with which q-network parameters are updated.

# initilizing the various data structures
replay_buffer = ExperienceReplayBuffer(batch_size, buffer_size, seed)
local_q_network = initialize_q_network()
target_q_network = initialize_q_network()
synchronize_q_networks(target_q_network, local_q_network)

for i in range(number_episodes)

    # initialize the environment state
    state = env.reset()

    # simulate a single training episode
    done = False
    timesteps = 0
    parameter_updates = 0
    while not done:

        # greedy action based on Q(s, a; theta)
        action = agent.choose_epsilon_greedy_action(state) 

        # update the environment based on the chosen action
        next_state, reward, done = env.step(action)

        # agent records experience in its replay buffer
        experience = (state, action, reward, next_state, done)
        agent.replay_buffer.append(experience)

        # agent samples a mini-batch of experiences from its replay buffer
        experiences = agent.replay_buffer.sample()
        states, actions, rewards, next_states, dones = experiences

        # agent learns every update_frequency timesteps
        if timesteps % update_frequency == 0:
            
            # compute the Q^- values using the Q-learning formula
            target_q_values = q_learning_update(target_q_network, rewards, next_states, dones)

            # compute the Q values
            local_q_values = local_q_network(states, actions)

            # agent updates the parameters theta using gradient descent
            loss = mean_squared_error(target_q_values, local_q_values)
            gradient_descent_update(loss)
            
            parameter_updates += 1

        # every target_network_update_frequency timesteps set theta^- = theta
        if parameter_updates % target_network_update_frequency == 0:
            synchronize_q_networks(target_q_network, local_q_network)
```


#### Plotting the time series of scores

I can use [Pandas](https://pandas.pydata.org/) to quickly plot the time series of scores along with a 100 episode moving average.

#### Kernel density plot of the scores

## Ideas for future work?

I have a few ideas for future work. Perhaps the most straight forward thing to do would be to 
continue following the literature and the algorithmic improvements from 
[*Dueling Network Architectures for Deep Reinforcement Learning*](https://arxiv.org/abs/1511.06581),
[*Noisy Networks for Exploration*](https://arxiv.org/abs/1706.10295), and perhaps 
[*Rainbow: Combining Improvements in Deep Reinforcement Learning*](https://arxiv.org/abs/1710.02298). 

However, I am interested in reinforcement learning directly from pixel data where the action space 
is continuous. To tackle these more challenging tasks it might make sense to jump ahead and work on 
implementing more recent approaches such as 
[*Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor*](https://arxiv.org/abs/1801.01290) and 
[*Image Augmentation is All You Need: Regularizing Deep Reinforcement Learnging from Pixels](https://arxiv.org/pdf/2004.13649.pdf).