### Deep Q-Networks

### University of Virginia
### Reinforcement Learning
#### Last updated: February 10, 2025

---


### SOURCES 

- Reinforcement Learning, RS Sutton & AG Barto, 2nd edition. Chapter 9
- Mastering Reinforcement Learning with Python, Enes Bilgin. Chapter 6

### LEARNING OUTCOMES

- Explain the novel ideas behind Q-Networks
- Explain the new challenges brought with Q-Networks 
- Explain the purpose of experience replay and implement it in code
- Explain the purpose of a target network and implement it in code
- Explain the purpose of soft updating the target network

### CONCEPTS

- function approximation
- experience replay
- target networks
- loss function
- soft update of target network

---  

### I. Recap of Q-Learning 

Given state space $S$ and action space $A$, learn values $Q(S,A)$.

$Q(S,A)$ is the action value function.

Values are organized in an array called the *Q-table*.

*Q-Learning* is a method for building this table.

The optimal action value function is defined as: 

$Q_*(s,a) := \underset{\pi}{\operatorname{\max}} Q_{\pi}(s,a)$

This considers the space of policies $\pi$ for the best policy $\pi_*$.

**Illustration of Q Table**

![q_table](./Q-Learning_Matrix_Initialized_and_After_Training.png)

### II. From Q-Learning to Deep Q-Networks

The table approach can be limiting: 
- space of states $S$, actions $A$ can grow large
- difficult to explore all combinations $S$ X $A$
- difficult to store all values

We make a large paradigm shift from computing a finite table of action values $Q(S,A)$ to using function approximation.

**Model Type** 

Many function types have been explored, including: 
- linear models
- polynomials
- Fourier basis function
- radial basis functions
- artificial neural networks

Current state of the art uses deep neural networks (DNN) for approximating the action-value function.  
This is what we will present going forward.  
Denote neural network parameters as $\theta$ and the action-value function as $Q_{\theta}$

The graphic below illustrates a deep neural network.  
The input layer holds each element of the state vector *s*.  
The outputs are the action values for each action given the state.  
In this case, there are two possible actions from the state, $a_1$ and $a_2$.

The outputs are action values: $q(s,a_1)$ and $q(s,a_2)$. 

![dqn](./dqn3.png)

Using this approach brings strengths and weaknesses:

**Pros**

- can estimate values for unobserved states
- can estimate values for massive number of states

**Cons**

- the function cannot be exactly correct in all states
- DNNs require customizations to work in practice; without customizations; the models generally don't converge.
- the states will no longer be independent as they share parameters.  
This results in a tradeoff where approximations for some states get better while others get worse

**Accepting Error and Minimizing it**  

For the finite table (Q-learning), convergence to the correct answer was guaranteed.  
In the framework of function approximators - where **we can't get correct answers for all states** - we need a metric to optimize.  
We want to get "good enough" on average, where errors are averaged over a set of points.

**Improving the model**  

We will use *stochastic gradient descent* to improve the model iteratively, taking an improvement step in the direction of the gradient.


---

### III. Experience Replay

One trick for helping with convergence is *experience replay*.  
Experience tuples or transitions $(s_t, a_t, r_t, s_{t+1})$ are stored in a buffer and reused

The size of the buffer $M$ is a hyperparameter

The tuples can be sampled uniformly at random, which reduces correlation between samples

The number of samples, or batch size, is denoted $N$ and it is a hyperparameter.  
Note that $N$ can be smaller than $M$

In more refined cases, weights can be assigned to each tuple to upweight their probability of selection.  
This is called *prioritized experience replay*.  
Higher priority can be given to certain $Q(s,a)$ with larger prediction errors.  
It can work well to prioritize based on TD errors.

---  

**Reminder on How to Compute TD(0) Updates**

We will use TD(0) updating, so let's recall the form:

$Q(s,a) := Q(s,a) + \alpha [r + \gamma \underset{a}{\operatorname{\max}} Q(s',a) -  Q(s,a)]$

The portion

$r + \gamma \underset{a}{\operatorname{\max}} Q(s',a)$

comes from the new data.

When using deep RL, $Q(s',a)$ will be replaced by $Q_{\theta'}(s',a)$ which is based on the model.

---  

### IV. Target Networks

Supervised learning pairs features with fixed targets.

In deep reinforcement learning, the targets $Q_{\theta}$ are *q* values which update iteratively.

**This is a large complication and without adjustment, it can be hard for algorithm to converge.** 

Essentially $Q_{\theta}$ is a moving target.

We maintain a second neural network which is a lagged copy of $Q_{\theta}$ and we call it $Q_{\theta'}$

$Q_{\theta'}$ is a *lagged neural network*

Let's put notation on the target.  
For non-terminal state,  

$y_j = r_j + \gamma \underset{a_j'}{\operatorname{\max}} Q_{\theta'}(s_j',a_j')$

For terminal state,  

$y_j = r_j$

where $y_j$ is the target value at timestep $j$.  


**Updating the Lagged Copy of Network**

We use a hyperparameter $C$ to trigger a sync between $Q_{\theta'}$ and $Q_{\theta}$.  
$C$ can represent the number of time steps until $Q_{\theta'}$ is updated.

When triggered, the update is made: $Q_{\theta'} := Q_{\theta}$ 

---

**Soft Update of Target Network**

The network update method just mentioned is binary: a full update happens at each step or it doesn't.  
It is a *hard update*.

An alternative is to do a *soft update* like this:

- At each time step, slowly update each element of $\theta'$ with a fraction of each element of $\theta$ 
- A smoothing parameter $\tau$ is used to apply the convex combination
- The value of $\tau$ will be small, such as 0.001

$\theta' = \tau \theta + (1-\tau) \theta'$ 

The soft update helps maintain stability in training, while updating the target network.  
This practice is used in some DQN implementations and other Deep RL methods.

---

**Loss Function**

We need to define and use a loss function as the model won't be exact in each state.  
Define the loss as follows:

$L(\theta) = \mathop{\mathbb{E}}_{(s,a,r,s')\sim U(D)} \left[\left(r + \gamma \underset{a'}{\operatorname{\max}} Q_{\theta'}(s',a') -  Q_{\theta}(s,a) \right)^2 \right]$

As a reminder, $D$ is the replay buffer with size $M$

---

### V. Implementing the Deep Q-Network

Now we put the pieces together. Here is the pseudocode, and later we will implement Python code.

1. Initialization
- Initialize $\theta$, the neural network parameters
- Set target network parameters $\theta' := \theta$. This is the neural network with lagged parameters.
- Initialize replay buffer $D$ with size $M$
- Set minimum batch size $N$ required for model update
- Set the hyperparameters for the neural network: number of hidden layers, neurons per hidden layer, ...
- Set all remaining hyperparameters: $\epsilon$, $\alpha$, $\gamma$, epochs, steps per epoch, ...

2. Set policy $\pi$ to be $\epsilon$-greedy with respect to $Q_\theta$
3. Given state $s$ and policy $\pi$ take action $a$
4. Observe reward $r$ and next state $s'$
5. Add transition $(s,a,r,s')$ to replay buffer $D$. If $|D|>M$, pop oldest transition.
6. If $|D|>N$, uniformly sample random minibatch of size $N$ transitions from $D$, else return to step 2.
7. Compute target values $y_j$ for each transition in minibatch.
8. Take a gradient step to update $\theta$, and then update loss function
9. Every $C$ time steps, make update $Q_{\theta'} := Q_{\theta}$ 

---

### VI. Computational Example of DQN

This example updates the model and syncs $Q_{\theta'} := Q_{\theta}$ after each transition, so $C=1$.  
In practice, we can set $C$ higher.


You might want to copy and run this code on [Google Colab](https://colab.research.google.com/?utm_source=scs-index)

In [None]:
import tensorflow as tf

from keras.models import Sequential
from keras.layers import Dense, BatchNormalization, LeakyReLU

from collections import deque
import math
import numpy as np
import os
import pandas as pd
import pdb
import random

In [1]:
batch_size = 10
epochs = 2
time_steps = 30

# states
sofa_levels = [0,1,2,3]
num_states = len(sofa_levels)
terminal_state = 3
state_size = 1 # dimensions of state space

# actions
vaso_dose = [0,1,2,3,4]
num_actions = len(vaso_dose) # number of possible actions

print('state_size:', state_size)
print('num_actions:', num_actions)

state_size: 1
num_actions: 5


In [1]:
# based on code sourced from: https://github.com/DrAPT/deep-q-learning/blob/master/dqn.py#L38
# TF2 + Keras

class DQN_Agent():
    def __init__(self, state_levels, state_size, action_size, verbose=False):
        
        self.state_levels  = state_levels
        self.state_size    = state_size
        self.action_size   = action_size
        self.verbose       = verbose
        
        self.memory_size   = 2000
        self.gamma         = 0.95   # discount rate
        self.epsilon       = 0.05  # exploration rate
        self.epsilon_decay = 0.995
        self.epsilon_min   = 0.01
        self.leaky_rate    = 0.01
        
        # model parameters
        self.learning_rate = 0.001
        self.hidden_1_size = 24
        self.hidden_2_size = 24
    
        self.memory = deque(maxlen=self.memory_size)
        
        self.model = self._build_dqn_model()

    def _build_dqn_model(self):
        model = Sequential()
        # hidden layer 1
        model.add(Dense(self.hidden_1_size, input_dim=self.state_size))
        model.add(BatchNormalization())
        model.add(LeakyReLU(alpha=self.leaky_rate))
        # hidden layer 2
        model.add(Dense(self.hidden_2_size))
        model.add(BatchNormalization())
        model.add(LeakyReLU(alpha=self.leaky_rate))
        # output layer
        model.add(Dense(self.action_size, activation='linear'))
        model.compile(loss='mse', optimizer="adam")

        # initialize model by training on pairs: (state_level, random_qvalues_for_each_action) for each state_level
        for st in self.state_levels:
            model.fit(np.array([st]).reshape(1,1), np.random.random(self.action_size).reshape(1,self.action_size), verbose=0)
            
        return model
 
    def act(self, state):
        # epsilon-greedy selection of actions
        if np.random.rand() <= self.epsilon: # random draw with prob epsilon
            return random.randrange(self.action_size)
        act_values = self.model.predict([state]) # take optimal action
        return np.argmax(act_values)  # returns action
    
    def determine_next_state(self, state, action):
        '''
        return next state from the environment
        NOTE: in practice, this might use a model or historical tuples  
        '''
        if (state in [0,1,2]) & (action == 0): # no dose raises state
            next_state = min(terminal_state, state + 1)
        elif action in [3,4]: # higher doses lowers state (floored at zero)
            next_state = max(0, state - 1)
        else:
            next_state = random.choice([1,2])
        return next_state
        
    def compute_reward(self, state):
        '''
        simple reward function for illustration. lower state value is better.
        '''
        
        if state == 3:
            reward = -100
        elif state == 2:
            reward = -10
        elif state == 1:
            reward = 0
        else:
            reward = 10
        return reward

        
    def memorize(self, state, action, reward, next_state, done):
        # cache transitions
        self.memory.append((state, action, reward, next_state, done))
    
    def replay(self, batch_size):
        print('==replaying')
        minibatch = random.sample(self.memory, batch_size)
        
        if self.verbose:
            print('minibatch size:', len(minibatch))
            print(minibatch, '\n')
        for state, action, reward, next_state, done in minibatch:
            # target is reward if state is termination state
            target = reward
            if not done:
                # value iteration update following Bellman equation
                # target equal to reward + predicted discounted future q-value of next state (taking the best action in the state)
                target = (reward + self.gamma * np.amax(self.model.predict(np.array([next_state]))))
            
            # target_f = predicted reward + discounted future q-value given the state, for each action (not just the best action)
            target_f = self.model.predict(np.array([state]))
            
            # update q-value for the given action
            target_f[0][action] = target
            
            # update the model
            self.model.fit(np.array([state]).reshape(1,1), target_f, epochs=1, verbose=0)
            
            print('==predicted q-fcn, state 0:', self.model.predict(np.array([0])))
            print('==predicted q-fcn, state 1:', self.model.predict(np.array([1])))
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

**Main**  
Simulate trajectories (state, action, reward, next_state)

In [None]:
agent = DQN_Agent(sofa_levels, state_size, num_actions)

for ep in range(epochs):
    done = False
    state = random.choice([0,1,2]) # exploring starts

    for ts in range(time_steps):
        action = agent.act(state)
        next_state = agent.determine_next_state(state, action) 
        reward = agent.compute_reward(next_state)
        done = True if next_state == terminal_state else False
        agent.memorize(state, action, reward, next_state, done)
        print('epoch:', ep, ', time_step:', ts, ', state:', state, ', action:', action, ', reward:', reward, ', next_state:', next_state, ', done:', done)
        state = next_state
        if done: # epoch over
            break
        
        mem_size = len(agent.memory)
    
        if mem_size > batch_size:
            agent.replay(batch_size)
            
    #print('memory size:', mem_size)
    #print('memory:\n')
    #print(agent.memory)
            
    print('\n')

---

**Question 1**

From state 0, which action seems best?

**Question 2**

Does the Q function seem to converge? Note: DQN doesn't always converge; further methods have been developed to ameliorate this issue.

**Question 3 - Monte Carlo Simulation**

Now that you've trained a Q function, you will use it in a control problem.   
Specifically, write code to implement the following:

- simulate 10 episodes using 50 time steps each
- for each episode, begin in state 0
- use the policy to take the next step
- get the next state and reward
- terminate the episode if the next state = 3
- for each time step, print (episode, time_step, state, action, next_state, reward)
- compute the cumulative reward for each episode (without discounting)
- store and print the cumulative rewards across episodes, computing their min, max, mean
- store and print the cumulative medication dose across episodes, computing their min, max, mean

Do the results make sense?

---

**Question 4 - Multiple Objectives**

Suppose that the clinician wishes to meet these simultaneous objectives: 

- lower SOFA score is better 
- SOFA of 3 must be avoided
- lower total medication dosing is better

a) Explain how you could modify the RL problem to achieve these goals.


b) Run an MC simulation following the same procedure as Exercise 3.   
Based on the average cumulative reward and average cumulative dose, comment on the results.  
How does your approach perform?


---