# Titel

There is one reoccuring problem regarding policies in Reinforment Learning: How do you optimize
a policy when you need to act non-optimally to explore all actions to find the optimal action?

This problem leads to two core approaches regarding policy optimization:


## On-policy VS off-policy
### On-policy
The on-policy approach acts as a compromise by following an almost optimal policy with room for exploration.

### Off-policy
A different idea is using a separate policy (e.g. a greedy policy) to generate the data. This way the agent might choose actions not part of the target policy thus exploring a wider range of actions that will eventually be utilized to further optimize the target policy.

## Temporal-Difference Learning
To get a better understanding of the two different concepts it's necessary to introduce the term Temporal-Difference Learning.TD Learning is a model-free method that learns from incomplete episodes. This means each episode doesn´t have to terminate to be used to update the target policy. This is particular important because some episodes may take alot of time to terminate and thus  being able to learn from these episodes earlier can spare you alot of time. More specificly the TD methods only need to wait one additional time step to update the policy based on $R_{t+1}$ and estimate $V(S_{t+1})$.


### TD Learning on-policy example: SARSA

### TD Learning off-policy example: Q-learning 
In Q-Learning, the agent learns optimal policy with the help of a greedy policy. It updates its Q-Values according to this equation:
\begin{align}
Q(S_t,A_t)\leftarrow Q(S_t,A_t)+\alpha \left[R_{t+1}+\gamma \max_{{a}} Q(S_{t+1},a)-Q(S_t,A_t)\right]
\end{align}


To get a better understanding of the efficiency of Q-learning an example is provided. It compares a solution using dynamic programming to one using Q-learning. The problem consists of a taxi that starts at a random location and has to pick up a customer and drop him off at a target location. The pick up and drop off locations are chosen randomly from 4 different possibilites.

In [2]:
import gym


env= gym.make("Taxi-v3").env
env.reset()



epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False


while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1

def print_frames(frames):
    frame = frames[-1]
    print(frame['frame'])
    print(f"Timestep: {len(frames)}")
    print(f"State: {frame['state']}")
    print(f"Action: {frame['action']}")
 
print_frames(frames)


+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 502
State: 0
Action: 5


In [1]:
import gym
import random
import numpy as np


env= gym.make("Taxi-v3")
env.reset()

# Hyperparameters
q_table=np.zeros([env.observation_space.n, env.action_space.n])
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []
frames = []


def print_frames(frames):
    frame = frames[-1]
    print(frame['frame'])
    print(f"Timestep: {len(frames)}")
    print(f"State: {frame['state']}")
    print(f"Action: {frame['action']}")
    print(f"Reward: {frame['reward']}")
        
for i in range(1, 10000):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1
        
   # if i % 100 == 0:
      #  clear_output(wait=True)
       # print(f"Episode: {i}")

done= False

state=env.reset()

while not done:
        action=np.argmax(q_table[state])
        state,reward,done,info=env.step(action)
        
        frames.append({
            'frame': env.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward 
        }
        )
print_frames(frames)
        
print("Training finished.\n")



+---------+
|R: | : :[35m[34;1m[43mG[0m[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 14
State: 85
Action: 5
Reward: 20
Training finished.



The difference in timesteps taken for the two alogirthms to finish the task gives a good understanding about the amount of time saved by using a TD learning based algorithm

## References

- [Reinforment Learning: An introcution - Richard S. Sutton and Andred G. Barot](https://books.google.de/books?hl=de&lr=&id=uWV0DwAAQBAJ&oi=fnd&pg=PR7&dq=richard+sutton+andrew+barto&ots=miqNm2-_i9&sig=Xv2GFGQyFAemej2n6HMvDU01oiE&redir_esc=y#v=onepage&q=richard%20sutton%20andrew%20barto&f=false)
- [Lilian Wang: A (Long Peek into Reinforcement Learning)](https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html)