# Advanced RL Concepts (TD and On- vs Off-policy learning)

Today we will be looking at two RL algorithms which are very similar on the surface but show one key difference. Both algoithms learn the action-value function $Q(s, a)$ in order to find the best policy possible. Once $Q(s, a)$ has been learned and optimized, the policy $\pi(a | s)$ can be derived by greedily chosing the best action $a$ leading to the highest value of $Q(s, a)$ when in state $s$.


## Temporal-Difference Learning

Both algorithms we look at today utilize **Temporal-Difference (TD)** Learning. TD-Learning that either the action-value function $Q(s, a)$ or the value funtion $V(s)$ are improved by shifting the existing function acording to the observed experiences. In this, TD-Learning shares some similiarties with stochastic gradient descent (SGD) as employed in neural networks, seeing as in each iteration SGD slightly shifts the parameters to gradually improve the neural network performance.

Each TD-Learning step considers a sequence $s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}$ from a trajectory $\tau$ at a timestep $t$. Further a hyperparameter $\alpha \in (0, 1)$ is needed and the discount of future rewards $\gamma$ has to be taken into condieration. TD-Learning shifts the existing function according to the **TD target** given by $r_{t+1} + \gamma V(s_{t+1})$ or $r_{t+1} + \gamma Q(s_{t+1}, a_{t+1})$ respectively, which estimates the function in the observed trajectory:

\begin{align*}
    \text{new function} &= (1 - \alpha) \cdot \text{current function} + \alpha \cdot \text{observed function} \\
    \text{new value} &= (1 - \alpha) \cdot \text{old value} + \alpha \cdot \text{TD target}
\end{align*}

\begin{align*}
    V(s_t) &= (1 - \alpha) V(s_t) + \alpha (r_{t+1} + \gamma V(s_{t+1})) \\
    V(s_t) &= V(s_t) - \alpha V(s_t) + \alpha (r_{t+1} + \gamma V(s_{t+1})) \\
    V(s_t) &= V(s_t) + \alpha (r_{t+1} + \gamma V(s_{t+1}) - V(s_t)) \\
\end{align*}

\begin{align*}
    Q(s_t, a_t) &= (1 - \alpha) Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1})) \\
    Q(s_t, a_t) &= Q(s_t, a_t) - \alpha Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1})) \\
    Q(s_t, a_t) &= Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)) \\
\end{align*}

This process of updating a function by condidering the current estimate for that function is also called botstraping. It should also be noted, that TD-Learning can learn on past experiences and even incomplete trajectories as only a sequence $s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}$ needs to be considered. This may be of importance when considering RL problems which have no terminal state


## $\epsilon$-Greedy

RL-Algorithms which learn from genereted trajectories face a dilemma. The generation of a trajectory $\tau$ requires a policy $\pi(a | s)$. On one hand, this policy should try to maximize the expected return so that the new trajectory can be used to further refine and exploit the learned parameters. On the other hand, this ploicy need to explore new options which have not been considered beforehand, else we might get stuck in a very suboptimal local extremum. 

Thus, we should take a brief look at the **$\epsilon$-Greedy** policy. This policy contains a simple idea to balance exploration and exploration and explotation when generating a policy using the current learned model, using a hyperparameter $\epsilon \in (0, 1)$. When choosing an action $a$ in a given state $s$ the $\epsilon$-Greedy policy will choose one of two options at random.

The first option with propability $\epsilon$ is to choose a random action $a$ from all available actions. This option contains the exploration aspect and hopefully lead to the discovery of new strategies.

The second option with propability $1 - \epsilon$ is to choose the action $a$ greedily based on the current learned information. This option contains the exploitation aspect, refining existing strategies further.

## Introduction code examples
To compare the different approaches, each method is applied to OpenAI Gym's Taxi environment. A taxi picks up a passenger from four possible locations and drops him off at the target destination. To achieve that goal the taxi can move in each of the four cardinal directions, as well as picking up and dropping off the passenger. The methods' efficiency was evaluated using the variable "Timesteps" which indicates the amount of actions taken by the taxi driver to reach the goal.

## Example: random action selection

In [21]:
import gym


env= gym.make("Taxi-v3").env
env.reset()



epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False


while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1

def print_frames(frames):
    frame = frames[-1]
    print(frame['frame'])
    print(f"Timestep: {len(frames)}")

 
print_frames(frames)
print("Training finished.\n")

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[35m[34;1m[43mY[0m[0m[0m| : |B: |
+---------+
  (Dropoff)

Timestep: 1184
Training finished.



## SARSA

**SARSA** is the most basic TD-Learning algorithm for learning the action-value function $Q(s, a)$. The name reflects, that each learning step is done based on the sequence state-action-reward-state-action $s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}$. 

1. generate trajectory $\tau$ using $\epsilon$-Greedy policy.
2. for each sequence $s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}$ adjust $Q(s_t, a_t)$ according to:

\begin{align*}
    Q(s_t, a_t) &= Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)) \\
\end{align*}

3. If learning is insufficient and time left go back to step 1. Else terminate the algorithm and output $Q(s, a)$ and greedy policy if needed.

## Example: SARSA learning

In [24]:
import gym
from IPython.display import clear_output
from time import sleep
import random
import numpy as np

env = gym.make('Taxi-v3').env

# Hyperparameters
q_table=np.zeros([env.observation_space.n, env.action_space.n])
alpha = 0.1
gamma = 0.6
epsilon = 0.1

all_epochs = []
all_penalties = []
frames = []

# Function to print out the game-state
def print_frames(frames):
    frame = frames[-1]
    print(frame['frame'])
    print(f"Timestep: {len(frames)}")

          
# Starting the SARSA learning
for i in range(10000):
    epochs, penalties, reward, = 0, 0, 0
    state = env.reset()
    # Choosing action
    if random.uniform(0, 1) < epsilon:  
        action = env.action_space.sample()
    else:
        action = np.argmax(q_table[state])
    done=False
  
    while not done:
          
        # Getting the next state
        next_state, reward, done, info = env.step(action)
        
        # Choosing the next action
        if random.uniform(0, 1) < epsilon:  
            next_action = env.action_space.sample()
        else:
            next_action = np.argmax(q_table[next_state])
  
         
        old_value = q_table[state, action]
        new_value = q_table[next_state,next_action]        
        #Learning the Q-value
        q_table[state, action] = old_value + alpha * (reward + gamma * new_value - old_value)
  
        state = next_state
        action = next_action
          
        #Updating the respective vaLues
        if reward == -10:
            penalties += 1
        epochs+=1
    
    #Printing out episode counter
    
    #if i % 100 == 0:
       #clear_output(wait=True)
       #print(f"Episode: {i}")        
            
#Evaluating the performance
done= False

state=env.reset()

while not done:
        action=np.argmax(q_table[state])
        state,reward,done,info=env.step(action)
        
        frames.append({
            'frame': env.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward 
        }
        )
print_frames(frames)

print("Training finished.\n")



+---------+
|R: | : :[35m[34;1m[43mG[0m[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 14
Training finished.



## Q-Learning

**Q-Learning** is very similar to SARSA except for the utilization of a different TD-target. The new TD-target is $r_{t+1} + \gamma \max_{a \in \mathcal{A}} Q(s_{t+1}, a)$. In effect, we no longer look at the reward returned in the observed trajectory but the maximum possible reward which can be obtained from state $s_{t+1}$. Thus we also no longer requre $a_{t+1}$ and a sequence now only contains $s_t, a_t, r_{t+1}, s_{t+1}$.

1. generate trajectory $\tau$ using $\epsilon$-Greedy policy.
2. for each sequence $s_t, a_t, r_{t+1}, s_{t+1}$ adjust $Q(s_t, a_t)$ according to:

\begin{align*}
    Q(s_t, a_t) &= Q(s_t, a_t) + \alpha (r_{t+1} + \gamma \max_{a \in \mathcal{A}} Q(s_{t+1}, a) - Q(s_t, a_t)) \\
\end{align*}

3. If learning is insufficient and time left go back to step 1. Else terminate the algorithm and output $Q(s, a)$ and greedy policy if needed.

**(maybe expand algorithms to not use trajectories but gradualy generated tuples)**

## Example: Q-learning

In [19]:
import gym
import random
import numpy as np


env= gym.make("Taxi-v3")
env.reset()

# Hyperparameters
q_table=np.zeros([env.observation_space.n, env.action_space.n])
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []
frames = []



# Function to print out the game-state
def print_frames(frames):
    frame = frames[-1]
    print(frame['frame'])
    print(f"Timestep: {len(frames)}")


# Starting the Q-learning        
for i in range(1, 10000):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        # Choosing action
        if random.uniform(0, 1) < epsilon:  
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[state])
        
        # Getting next state
        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        # learning the Q-value
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1

# Evaluating the performance

done= False

state=env.reset()

while not done:
        action=np.argmax(q_table[state])
        state,reward,done,info=env.step(action)
        
        frames.append({
            'frame': env.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward 
        }
        )
print_frames(frames)
        
print("Training finished.\n")



+---------+
|R: | : :[35m[34;1m[43mG[0m[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 11
Training finished.



## Conclusion of examples
By using one of the provided reinforcement algorithms the goal was reached in a much shorter timeframe than the control method. The two former performed similarly throughout the test period.

## On- vs Off-policy

The change in the TD-target may seem small yet it is very important. SARSA estimates the Q-value assuming the $\epsilon$-Greedy policy used to generate the data $s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}$ continues to be followed. Thus, SARSA optimates the $\epsilon$-Greedy policy and not the greedy policy. We call this **on-policy** since the policy used for data generation and updates are **the same**.

In contrast, Q-Learning also generates data using the $\epsilon$-Greedy policy, yet Q-Learning updates based on the greedy policy. Through this, Q-Learning always tries to improve the greedy policy. This behaviour is called **off-policy** since the policy used for data generation and updates are **not** the same.

## References

- [Lilian Wang: A (Long Peek into Reinforcment Learning)](https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html)

- [OpenAI: Spinning Up](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#optional-formalism)

- [Reinforcement Learning: An introcution - Richard S. Sutton and Andred G. Barot](https://books.google.de/books?hl=de&lr=&id=uWV0DwAAQBAJ&oi=fnd&pg=PR7&dq=richard+sutton+andrew+barto&ots=miqNm2-_i9&sig=Xv2GFGQyFAemej2n6HMvDU01oiE&redir_esc=y#v=onepage&q=richard%20sutton%20andrew%20barto&f=false)

- [Q-Learning and SARSA, with Python](https://towardsdatascience.com/q-learning-and-sasar-with-python-3775f86bd178)

- [Reinforcement Q-Learning from Scratch in Python with OpenAI Gym](https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/)