### Q-Learning

### University of Virginia
### Reinforcement Learning
#### Last updated: November 3, 2023

---


### SOURCES 

- Reinforcement Learning, RS Sutton & AG Barto, 2nd edition. Chapter 6
- Mastering Reinforcement Learning with Python, Enes Bilgin. Chapter 5

### LEARNING OUTCOMES

- Explain how Q-Learning works and how it learns off policy
- Use Q-Learning to compute value functions  
- Perform sensitivity analysis on a Q-Learning algorithm

### CONCEPTS

- Q-Learning to act off policy
- The Q-Learning algorithm

---  

### I. Q-table

We recall the big picture of what we're trying to do:  
Given state space $S$ and action space $A$, learn values $Q(S,A)$  
These are organized in an array called the *Q-table*.

Q-Learning is a method for building this table.

We initialize the table (zeros, random values with zeros at terminal condition, etc.) and then use TD(0) updates for training.

<img src="./Q-Learning_Matrix_Initialized_and_After_Training.png">

### II. Q-Learning

Q-Learning is an **off-policy TD control algorithm** that was an early breakthrough in RL.

Quick reminder of what off-policy means:

We want action-value estimates. To make improvements requires exploring. These two things are at odds.

Consider: You're looking for a faster route to work. If you try different routes, some will be slower.  
These slower routes shouldn't factor into the timing of the optimal route. You separate optimal route timing from exploration.

We do this by maintaining two policies:
- behavior policy for learning
- target policy for learning optimality

Now we show the update equation for improving $q_\pi(s,a)$  
It is very similar to the update step for the state value.

Since we will use sample data, $Q$ will denote estimates of $q_\pi$

$Q(s,a) := Q(s,a) + \alpha [r + \gamma \underset{a}{\operatorname{\max}} Q(s',a) -  Q(s,a)]$

Explaning the different components:

<img src="./q_learning_update.png">

An important difference is the $\underset{a}{\operatorname{\max}} Q(s',a)$ term where you might have expected $Q(s',a)$  

The agent computes the most valuable action and uses this in updating.

However, the agent many not actually take this step when $S_{t+1}=s'$, $A_{t+1}=a$ 

This is what it means to act off policy: the target policy is separated from the behavior policy.

---

**Septic Shock**

Next, let's look at a computational example. The objective is to reduce the chance of septic shock, measured by the proxy SOFA score, by using a drug called a vasopressor. The values are for illustration only. Following the code are a series of exercises that we will work through.

Background:  
- **Septic shock**: a life-threatening condition that happens when blood pressure drops to a dangerously low level after an infection
- **Sequential Organ Failure Assessment (SOFA) score** is a scoring system that assesses the performance of several organ systems in the body. We will use this to measure state.
- **Vasopressor (vaso)** a drug that healthcare providers use to make blood vessels constrict in people with low blood pressure.

In [None]:
import numpy as np
import random
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

# Initialize states, actions, Q function

# states
sofa_levels = [0,1,2,3]
num_states = len(sofa_levels)
terminal_state = 3

# actions
vaso_dose = [0,1,2,3,4]
num_actions = len(vaso_dose)

# initialize array to store action values Q
#Q = np.random.normal(size=(num_states, num_actions))
Q = np.zeros(shape=(num_states, num_actions))
#Q[terminal_state,:] = 0 # no action taken from terminal state, so no value


def act(epsilon, action_values):
    '''
    epsilon-greedy policy: return action using epsilon-greedy strategy
    '''
    action_size = len(action_values)
    if np.random.rand() <= epsilon: # random draw with prob epsilon
        return random.randrange(action_size)
    return np.argmax(action_values)  # returns action

def calc_reward(state):
    '''
    simple reward function for illustration. lower state value is better
    '''
    if state == 3:
        reward = -100
    elif state == 2:
        reward = -10
    elif state == 1:
        reward = 0
    else:
        reward = 10
    return reward

def determine_next_state(state, action):
    '''
    return next state from the environment
    to be replaced with simulated data or alternative    
    '''
    if (state in [0,1,2]) & (action == 0): # no dose raises state
        next_state = min(terminal_state, state + 1)
    elif action in [3,4]: # higher doses lowers state (floored at zero)
        next_state = max(0, state - 1)
    else:
        next_state = random.choice([1,2])
    return next_state

# Run the Process
num_episodes = 5000
max_timesteps = 100
epsilon = 0.1
alpha = 0.1 # weight on new data 
gamma = 0.99 # discount factor

for ep in range(num_episodes):
    if ep % 10 == 0:
        print('episode',ep+1)
    #print('(state,action,reward,next_state) transitions')
    sofa_level = 0 # initialize state
    done = False
    for tm in range(max_timesteps):
        
        # given state, get action from policy
        vaso_dose = act(epsilon, Q[sofa_level,:])
        
        next_sofa = determine_next_state(sofa_level, vaso_dose)
        reward = calc_reward(next_sofa)
        transition = (sofa_level,vaso_dose,reward,next_sofa)
        #print(transition)
        
        # update Q(S,A) using TD(0)
        # Q(S,A) = Q(S,A) + alpha (r + gamma * max_a Q(S',a) - Q(S,A))
        Q[sofa_level,vaso_dose] += alpha*(reward+gamma*np.amax(Q[next_sofa,:])-Q[sofa_level,vaso_dose])        
                
        sofa_level = next_sofa # update sofa for next iteration
        
        # terminal state check
        if next_sofa == terminal_state:
            done = True
            break
    if ep % 10 == 0:
        print('Q \n', Q)
    

**Exercise 1**

If the agent is in state 0, what is the most valuable action? what is least valuable action? Enter your final Q estimate here.

**Exercise 2**

How do your answers change with different $\alpha$? different $\epsilon$? Enter your final Q estimates here.

**Exercise 3**

We initialized Q with standard normal deviates. How do your answers in (1) change if you initialize Q with zeros?  
Enter your final Q estimates here.

**Exercise 4**

Does Q seem to converge? It will converge given enough iterations.

**Exercise 5**

Modify the code to return all transitions as a list of tuples. Paste the first 10 transitions below.

---

### III. Limitations of Q-Learning

As we've learned, Q-learning involves storing and updating a table or array of values $Q(S,A)$ where each element represents the value of a *(state,action)* tuple. This is called a *Q table*.

As the number of states and actions (the *state-action space*) grows, this approach becomes unmanageable in terms of both storage and computation. This occurs for continuous variables or discrete variables with a massive number of possible values.

There are two approaches to handle this issue:

- Quantize the values 

For example, medication doses might be bucketed into dose ranges  

- Function approximators for Q  

The function approximation is now very popular, with neural nets playing a major role.

**Going Deep**

When deep neural networks are used with Q-Learning, the model is called a *Deep Q-Network*. We will study these next.

In general, pairing reinforcement learning with a deep neural network is called *Deep Reinforcement Learning*, abbreviated Deep RL.

---