# Deep-Q networks (DQN)

## Table of contents

1. [Understanding Deep-Q Networks (DQN)](#understanding-deep-q-networks-dqn)
2. [Setting up the environment](#setting-up-the-environment)
3. [Defining the environment for training](#defining-the-environment-for-training)
4. [Building the DQN architecture](#building-the-dqn-architecture)
5. [Implementing the experience replay buffer](#implementing-the-experience-replay-buffer)
6. [Implementing the action selection policy](#implementing-the-action-selection-policy)
7. [Training the DQN agent](#training-the-dqn-agent)
8. [Evaluating the agent's performance](#evaluating-the-agents-performance)
9. [Experimenting with hyperparameters](#experimenting-with-hyperparameters)

## Understanding Deep-Q networks (DQN)

Deep Q-Networks (DQN) are a type of reinforcement learning (RL) algorithm that enhances the classic Q-learning approach by incorporating deep neural networks. DQN enables an agent to learn optimal actions in complex environments by approximating the Q-value function with a neural network. This is particularly useful when dealing with high-dimensional state spaces, such as video games, robotic control, or other decision-making tasks.

### **Key concepts in reinforcement learning**

Reinforcement learning revolves around an agent that interacts with an environment and makes decisions over time to maximize a cumulative reward. Key concepts include:
- **State**: Represents the environment at a given time.
- **Action**: The decision made by the agent at each step based on the current state.
- **Reward**: The feedback received from the environment after the agent takes an action.
- **Policy**: The strategy that defines the agent's behavior, mapping states to actions.
- **Q-value**: The expected cumulative reward for taking a specific action in a given state and following the optimal policy afterward.

### **Q-learning: The foundation of DQN**

Q-learning is a reinforcement learning algorithm that aims to learn the optimal **Q-value function**, which estimates the expected cumulative reward for each action in a given state. It works by updating the Q-values iteratively based on the reward received and the maximum predicted future reward. However, traditional Q-learning uses a table to store Q-values for each state-action pair, which becomes infeasible for environments with large or continuous state spaces.

### **Deep Q-Network (DQN)**

DQN extends Q-learning by approximating the Q-value function using a **deep neural network** instead of a table. The network takes the state as input and outputs Q-values for each possible action. This approach makes it possible to apply Q-learning in environments with high-dimensional or continuous state spaces.

The objective of DQN is to minimize the error between the predicted Q-values and the target Q-values, which are based on the agent's experience. The agent learns to make decisions by repeatedly interacting with the environment, storing experiences, and updating the network based on those experiences.

### **Key innovations in DQN**

To stabilize and improve the training process, DQN introduces several key techniques:

#### **Experience replay**

In reinforcement learning, consecutive experiences are often highly correlated, which can destabilize the training of a neural network. DQN addresses this issue by using **experience replay**, a mechanism that stores experiences (state, action, reward, next state) in a buffer. The agent samples random batches of experiences from the buffer to update the Q-network. This reduces the correlation between experiences and provides the agent with a more diverse set of training examples.

#### **Target network**

DQN uses a **target network** to stabilize learning. The target network is a copy of the Q-network, but it is updated less frequently. By using the target network to compute the target Q-values for updates, DQN reduces the risk of the Q-values oscillating or diverging during training. This separation between the Q-network (for action selection) and the target network (for target computation) helps to stabilize learning.

#### **Exploration-exploitation tradeoff**

DQN handles the **exploration-exploitation tradeoff** using an epsilon-greedy policy. At each step, the agent either explores the environment by taking a random action (with probability epsilon) or exploits its current knowledge by selecting the action with the highest predicted Q-value (with probability 1 - epsilon). Epsilon typically decays over time, starting with a high value to encourage exploration and gradually decreasing as the agent becomes more knowledgeable about the environment.

### **Training process**

The training of a DQN agent follows these general steps:
- The agent interacts with the environment, observing the current state and selecting an action based on the epsilon-greedy policy.
- The action is executed, and the agent receives a reward and observes the next state.
- The experience (state, action, reward, next state) is stored in the replay buffer.
- Random samples are drawn from the replay buffer to update the Q-network by minimizing the error between predicted Q-values and target Q-values.
- Periodically, the target network is updated to match the weights of the Q-network.

This process continues over multiple episodes, where the agent learns to improve its policy by iteratively refining its Q-value estimates.

### **Extensions of DQN**

DQN has been enhanced in several ways to improve its performance and address limitations:

- **Double DQN (DDQN)**: This variant of DQN reduces the overestimation bias that can occur when Q-values are updated. Double DQN separates the action selection and action evaluation steps, which improves the accuracy of the Q-value estimates.
- **Dueling DQN**: Dueling DQN separates the estimation of the value of being in a certain state (the value function) from the estimation of the advantage of taking a particular action (the advantage function). These two components are combined to compute the final Q-values, allowing the network to learn more efficiently, especially in environments where many actions may have similar outcomes.
- **Prioritized Experience Replay**: This enhancement to experience replay prioritizes experiences that are more informative or have a larger impact on learning. Experiences with higher priority (based on their TD-error) are sampled more frequently, allowing the agent to focus on more important experiences.

### **Applications of DQN**

DQN has been applied successfully in many areas, especially in environments where the state space is large or complex:
- **Video games**: DQN achieved notable success when it was applied to Atari games, where it learned to play directly from raw pixel inputs, achieving superhuman performance in several games.
- **Robotics**: DQN is used in robotic control tasks where the robot must learn to perform complex movements or manipulate objects in dynamic environments.
- **Autonomous systems**: DQN is also applied in autonomous driving, drones, and other systems where the agent must make decisions in real-time based on sensory input from the environment.

### **Challenges in DQN**

Despite its success, DQN has some limitations:
- **Training instability**: Training a DQN can be unstable, especially in environments with large action spaces or where rewards are sparse. Techniques like experience replay and target networks help, but additional strategies like reward shaping or curriculum learning are sometimes needed.
- **Sample inefficiency**: DQN requires a large number of interactions with the environment to converge to an optimal policy, which can be computationally expensive. Enhancements like prioritized experience replay help address this issue by improving sample efficiency.

### **Maths**

#### **Q-learning: The foundation of DQN**

In Q-learning, the goal is to learn the **Q-value function**, which estimates the expected cumulative reward for taking action $ a $ in state $ s $ and following the optimal policy afterward. This function is updated iteratively using the **Bellman equation**:

$$
Q(s, a) = r + \gamma \max_{a'} Q(s', a')
$$

Where:
- $ r $ is the immediate reward received after taking action $ a $ in state $ s $,
- $ \gamma $ is the discount factor that determines the importance of future rewards,
- $ s' $ is the next state,
- $ a' $ is the action that maximizes the Q-value in the next state $ s' $.

The Q-value function $ Q(s, a) $ represents the expected reward of taking action $ a $ in state $ s $, followed by the best possible actions in subsequent states. The goal of Q-learning is to find the optimal Q-values, which allows the agent to act optimally in the environment.

#### **DQN: Q-learning with a neural network**

Deep Q-Networks (DQN) extend Q-learning by using a neural network to approximate the Q-value function, rather than storing Q-values in a table. The network takes the state $ s $ as input and outputs the Q-values for all possible actions. The network is trained by minimizing the difference between the predicted Q-values and the target Q-values (computed using the Bellman equation).

Let the Q-network be parameterized by $ \theta $. The predicted Q-value for action $ a $ in state $ s $ is $ Q(s, a; \theta) $. The target Q-value is computed as:

$$
y = r + \gamma \max_{a'} Q(s', a'; \theta^-)
$$

Where:
- $ r $ is the reward received after taking action $ a $,
- $ s' $ is the next state,
- $ \gamma $ is the discount factor,
- $ Q(s', a'; \theta^-) $ is the target Q-value computed using a separate **target network** with parameters $ \theta^- $, which are periodically updated from the Q-network's parameters.

The loss function for updating the Q-network is the **mean squared error** between the predicted and target Q-values:

$$
L(\theta) = \mathbb{E}_{(s, a, r, s')} \left[ \left( y - Q(s, a; \theta) \right)^2 \right]
$$

Where $ (s, a, r, s') $ represents a sample from the agent’s experience, and the expectation is over the replay buffer.

The gradient of the loss function with respect to the network parameters $ \theta $ is used to update the network weights via gradient descent.

#### **Experience replay**

Experience replay is used to store the agent’s experiences $ (s, a, r, s') $ in a buffer, which helps break the temporal correlation between consecutive experiences. At each training step, a random minibatch of experiences is sampled from the buffer and used to compute the loss function. This stabilizes training by reducing variance in the Q-value estimates.

Mathematically, the experience replay buffer $ D $ contains a set of transitions $ (s, a, r, s') $. The loss function for updating the Q-network is computed by averaging over a minibatch of transitions sampled from $ D $:

$$
L(\theta) = \frac{1}{|D|} \sum_{(s, a, r, s') \in D} \left( y - Q(s, a; \theta) \right)^2
$$

#### **Target network**

The **target network** is a key innovation in DQN that stabilizes learning. While the Q-network $ Q(s, a; \theta) $ is updated frequently, the target network $ Q(s, a; \theta^-) $ is updated less often. The target network is used to compute the target Q-values during training, and its parameters $ \theta^- $ are updated periodically by copying the parameters of the Q-network $ \theta $. This decoupling reduces the risk of instability or divergence during training.

#### **Bellman equation and the Q-update**

At the heart of DQN is the **Bellman equation**, which provides a recursive relationship between the Q-values of consecutive states. The Q-update is computed by combining the immediate reward $ r $ with the discounted maximum future Q-value $ \max_{a'} Q(s', a'; \theta^-) $. This ensures that the Q-values capture both the immediate and future rewards expected from following the optimal policy:

$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right)
$$

Where:
- $ \alpha $ is the learning rate,
- $ \gamma $ is the discount factor.

This update rule drives the Q-values toward better approximations of the true expected cumulative reward for each state-action pair.

#### **Double DQN (DDQN)**

Double DQN (DDQN) addresses the issue of **overestimation** in Q-learning by separating the action selection and evaluation steps. In DDQN, the Q-network is used to select the action that maximizes the Q-value in the next state $ s' $, while the target network is used to evaluate the Q-value of that action. This reduces overestimation and leads to more accurate Q-value estimates.

The target Q-value in Double DQN is computed as:

$$
y = r + \gamma Q(s', \arg\max_{a'} Q(s', a'; \theta); \theta^-)
$$

#### **Dueling DQN**

Dueling DQN improves the Q-value estimation by decomposing the Q-value function into two separate components: the **value function** and the **advantage function**. The value function estimates the importance of being in a particular state, while the advantage function estimates the relative importance of actions in that state. The two components are combined to produce the Q-value:

$$
Q(s, a) = V(s) + \left( A(s, a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a') \right)
$$

Where:
- $ V(s) $ is the value of being in state $ s $,
- $ A(s, a) $ is the advantage of taking action $ a $ in state $ s $,
- $ \mathcal{A} $ is the set of possible actions.

This decomposition helps the network learn state values and action advantages more efficiently, especially in environments where some actions have little effect on the outcome.

## Setting up the environment


##### **Q1: How do you install the necessary libraries for building and training a DQN in PyTorch?**


##### **Q2: How do you import the required modules for building the DQN architecture and handling the environment in PyTorch?**


##### **Q3: How do you configure your environment to use GPU support for training the DQN in PyTorch?**

## Defining the environment for training


##### **Q4: How do you load an environment from OpenAI Gym for training a DQN agent?**


##### **Q5: How do you retrieve the state and action space from the Gym environment to define the DQN input and output?**


##### **Q6: How do you reset the environment in Gym and retrieve the initial state for training the agent?**

## Building the DQN architecture


##### **Q7: How do you define the architecture of the Q-network using `torch.nn.Module` in PyTorch?**


##### **Q8: How do you implement the forward pass of the DQN to predict Q-values given a state?**


##### **Q9: How do you initialize the weights of the Q-network to ensure stable training?**

## Implementing the experience replay buffer


##### **Q10: How do you create an experience replay buffer to store state transitions (state, action, reward, next state)?**


##### **Q11: How do you implement a method to add new transitions to the experience replay buffer?**


##### **Q12: How do you sample mini-batches of experiences from the replay buffer to train the DQN?**


##### **Q13: How do you limit the size of the replay buffer to prevent memory overflow during training?**

## Implementing the action selection policy


##### **Q14: How do you implement an epsilon-greedy policy for selecting actions based on Q-values predicted by the DQN?**


##### **Q15: How do you decay the epsilon value over time to gradually shift from exploration to exploitation?**


##### **Q16: How do you select an action using the epsilon-greedy policy during training and switch to a greedy policy during evaluation?**

## Training the DQN agent


##### **Q17: How do you implement the training loop for the DQN, including resetting the environment and selecting actions using the epsilon-greedy policy?**


##### **Q18: How do you store transitions in the experience replay buffer after each interaction with the environment?**


##### **Q19: How do you compute the target Q-values using the Bellman equation for updating the DQN?**


##### **Q20: How do you perform backpropagation and update the Q-network's weights using the loss between target and predicted Q-values?**


##### **Q21: How do you periodically copy the weights from the main Q-network to the target network to stabilize training?**

## Evaluating the agent's performance


##### **Q22: How do you evaluate the performance of the DQN agent on the Gym environment using a greedy policy (without exploration)?**


##### **Q23: How do you visualize the cumulative reward the agent accumulates over episodes during evaluation?**


##### **Q24: How do you save and reload the trained DQN model to evaluate it on new episodes without retraining?**

## Experimenting with hyperparameters


##### **Q25: How do you adjust the learning rate and observe its impact on the training stability and performance of the DQN agent?**


##### **Q26: How do you modify the discount factor (gamma) in the Bellman equation, and how does it affect the agent’s long-term reward optimization?**


##### **Q27: How do you experiment with different batch sizes for sampling experiences from the replay buffer to improve training efficiency?**


##### **Q28: How do you experiment with different architectures for the Q-network (e.g., adding more layers or changing activation functions) to improve the model's learning capacity?**


##### **Q29: How do you adjust the epsilon decay rate to control how quickly the agent shifts from exploration to exploitation?**


##### **Q30: How do you adjust the target network update frequency, and how does it affect the stability of training?**

## Conclusion