## Q learning
Q-learning is a special algorithm that provides an off policy method for Temporal Difference style control.
First of all, it analyzes the SARSA update rule and modifies it in order to bootstrap the Q value at the next state before choosing and executing an action.
Thus, the update rule becomes: $Q_\pi(s_t, a_t) \gets Q_\pi(s_t, a_t) + \alpha[R_t+ \gamma max_{a’} Q_\pi(s_t ,a’) - Q_\pi(s_t,a_t)]$.
This introduces what is called a maximization bias, which can be good or bad depending on the problem, although it is usually a drawback instead of an advantage.
In other words, it introduces some bias in the agent that some actions are better than others, even though the actual real Q values might be different.
Nevertheless, Q-learning has proven to achieve great results in games.

### Characteristics of Q learning:

##### Model free
Q learning is model free, i.e. it does not require full knowledge of all states or transition dynamics.

##### Finite and discrete state and action spaces
In order for Q learning to work, the environment has to have a finite state and action space, because it saves state-action values in a dictionary internally.
The aforementioned is only possible if the state and action spaces are finite and discrete.

##### Off policy
 - On policy methods attempt to evaluate or improve the policy that is used to make decisions.
 - Off policy methods evaluate or improve a policy different from that used to generate the data.
 Something very positive for off policy methods is that they can figure out the optimal policy **regardless** of the agent’s actions and motivation.

Q learning in particular is an off policy algorithm.

##### Q values: state-action pair values
Q learning evaluates and updates Q values, i.e. state-action pair values instead of state values, i.e. V values.
This basically means that it does not bind the expected future sum of rewards to a state, but rather binds the expectancies to state-action pairs.

##### Convergence
Q learning is guaranteed to converge to a *local optimum* in case of infinite visits to state-action pairs.

##### Bootstrapping & Sample based
Q learning combines bootstrapping with sampling, just like SARSA also did.
 - The bootstrapping aspect is common with previous dynamic programming methods.
 The update formula consists in a Bellman backup over just one transition. It is executed every transition and the state-action value for the current state-action pair is bootstrapped from the old estimate of the sampled next state - next action pair.
 The reason we are able to backup over just one transition is because we leverage the **Markovian assumption** of the domain.
 - Sampling is common with Monte Carlo methods in order to allow for a model free algorithm.

##### Decoupling the Q function estimator from the actual policy
Since Q learning is off policy, we introduce the notion of a Q value estimator different from the actual policy.
This estimator is essentially a maximization operator over the possible actions of the next state.

This allows us to bootstrap the Q value of the next state **without** actually executing an action in the next state.

##### Maximization bias
We introduced a maximization operator as estimator of the Q value over the possible actions of the next state.
A downside to this is that estimators with unusually large values at time $t$ are chosen more frequently, and this leads to an exaggeration of the true value of the next state.

In the end, our state value estimate is **at least as large** as the true value of state $s$, so we are systematically overestimating the value of the state in the presence of finite samples.

##### Finite & Infinite Horizon
Q learning can be used in both finite or infinite horizon settings, i.e. it works with both episodic or non-episodic domains.
Infinite horizon settings are possible because SARSA update rules for the Q value function happen each step, not after the end of an episode like in Monte Carlo.

##### Lower variance and lower data efficiency than Monte Carlo
Q learning has lower variance and lower data efficiency when compared to Monte Carlo methods.
This is a property inherited from Temporal Difference learning.
For more information, have a look at [Temporal Difference learning](../4-temporal-difference/td_agent.ipynb).

##### Lower data efficiency when using Q values than using V values
Agents that work with state-action pair values usually have lower data efficiency than their counterparts working with state values while exploring.
 - more memory is used to represent state-action pair values than state values
 - more data is needed to train the agent, i.e. the agent needs to spend more time interacting with the environment to learn a good policy while exploring

##### Epsilon greedy policy
Epsilon greedy policies determine how often will the agent explore and how often will the agent exploit.

Furthermore, we want the epsilon greedy policy to be **greedy in the limit of exploration (GLIE)**.
 - all state-action pairs are visited an infinite number of times
 - $\epsilon_{t} → 0$ as $ t → 0 $, i.e. the policy is greedy in the limit and converges to 0

In our case, the update rule after each step for our epsilon is the following:
$ \epsilon \gets 1 / ( c_{\epsilon} \times f_{\epsilon})$, where $ c_{\epsilon} $ is a counter that increments after each episode has ended, whereas $ f_{\epsilon} $ is a constant factor.

##### Discount factor
The discount factor must take a value in the range $[0...1]$.
 - setting it to $1$ means that we put as much value to future states as the current state.
 - setting it to $0$ means that we do not value future states at all, only the current state.

##### Learning rate
The learning rate *usually* takes any value in the range $[0...1]$.
 - setting a value bigger than $1$ gives a higher weight to newer data, which can help learning in non-stationary domains.
 - values closer to $0$ gives a higher weight to older data.
 - values closer to $1$ gives almost the same weight to old and new data.

##### Theorem: Robbins-Munro sequence for Learning rate
Because Q learning is off-policy, it will learn an locally optimal policy independent of the agent’s actions.
So Q learning **empirically** figures out the optimal policy regardless of the agent’s motivation.

Nevertheless, mathematically speaking there is **no** guaranteed convergence to such an optimal policy.
This means that even though Q learning can be more robust than SARSA in finding optimal policies empirically, bad cases can happen, even more so when it suffers from maximisation bias.

To **guarantee** convergence we again introduce the following conditions.

Finite-state and finite-action MDP's converges to the optimal action-value, i.e. Q(s, a) → q(s, a), if the following two conditions hold:
 1. The sequence of policies $\pi$ is GLIE
 2. The step-sizes $\alpha_t$ satisfy the Robbins-Munro sequence such that:
  - $ \sum^{\infty}_{t=1} \alpha_t = \infty $
  - $ \sum^{\infty}_{t=1} \alpha^2_t < \infty $

That is why we are going to use a **decaying learning rate**, like we did in Incremental Monte Carlo that satisfies the above conditions.
If we use a learning rate similar to the one we used in Incremental Monte Carlo, of the form $ k \times 1/c_{\epsilon}$ we can be sure that it satisfies the above conditions.

##### Initialization
For Q learning we keep track of the following:
 - Q value functions, initially set to $0$
 - In order to showcase how robust off policy algorithms like Q learning are, we are going to keep the epsilon and learning rate constant.
   - `self.learning_rate` is set to $0.4$.
   - `self.epsilon` is set to $0.1$.
 - `self.discount_factor` is set to $0.9$.

In [None]:
import numpy as np
import random
from environment import Env
from collections import defaultdict

class QLearningAgent:
    def __init__(self, actions):
        # actions = [0, 1, 2, 3]
        self.actions = actions
        self.learning_rate = 0.4
        self.discount_factor = 0.9
        self.epsilon = 0.1
        self.q_table = defaultdict(lambda: [0.0, 0.0, 0.0, 0.0])

### Q learning

Q learning is an off policy algorithm that combines sampling with bootstrapping in an off policy algorithm.

The update rule for Q values in Q learning is the following:

$Q^{\pi}(s_t, a_t) \gets Q^\pi(s_t, a_t) + \alpha[R_t+ \gamma max_{a’} Q^\pi(s_t ,a’) - Q^\pi(s_t,a_t)]$ where:
 - $Q^{\pi}(s_t, a_t)$ - Q value of current state-action pair following the policy $\pi$
 - $Q^{\pi}(s_{t+1}, a_{t+1})$ - the current estimate following the policy $\pi$ of the state value of the next state.
 - $\alpha$ - the **learning rate**.
 Learning rate can take any value int the range $[0...1]$.
 Values closer to 0 mean that we put more value to older experiences, whereas values closer to 1 means that we put more value to latest experiences.
 In our case, the learning rate takes the value $0.4$.
 - $r_t$ - the reward at time-step $t$.
 - $\gamma$ - the **discount factor**.
 Traditionally used when calculating returns, now it is used when calculating **expectancies of returns**, i.e. state values.

The difference $r_t + \gamma Q^{\pi}(s_{t+1}, a_{t+1}) − Q^{\pi}(s_t, a_t)$ is commonly referred to as the **TD error**.

The sum $r_t + \gamma Q^{\pi}(s_{t+1}, a_{t+1})$ is referred to as the **TD target**.

In [None]:
class QLearningAgent(QLearningAgent):
    # update q function with sample <s, a, r, s'>
    def learn(self, state, action, reward, next_state):
        current_q = self.q_table[state][action]
        # using Bellman Optimality Equation to update q function
        QL_Target = reward + self.discount_factor * max(self.q_table[next_state])
        QL_Error = QL_Target - current_q
        self.q_table[state][action] = current_q + self.learning_rate * QL_Error

### Other methods

##### Helper methods

In [None]:
class QLearningAgent(QLearningAgent):
    # get action for the state according to the q function table
    # agent pick action of epsilon-greedy policy
    def get_action(self, state):
        if np.random.rand() < self.epsilon:
            # take random action
            action = np.random.choice(self.actions)
        else:
            # take action according to the q function table
            state_action = self.q_table[state]
            action = self.arg_max(state_action)
        return action

In [None]:
class QLearningAgent(QLearningAgent):
    @staticmethod
    def arg_max(state_action):
        max_index_list = []
        max_value = state_action[0]
        for index, value in enumerate(state_action):
            if value > max_value:
                max_index_list.clear()
                max_value = value
                max_index_list.append(index)
            elif value == max_value:
                max_index_list.append(index)
        return random.choice(max_index_list)

##### Main loop

In [None]:
class QLearningAgent(QLearningAgent):
    def mainloop(self, env, verbose=False):
        for episode in range(1000):
            state = env.reset()

            while True:
                env.render()

                # take action and proceed one step in the environment
                action = self.get_action(str(state))
                next_state, reward, done = env.step(action)

                # with sample <s,a,r,s'>, agent learns new q function
                self.learn(str(state), action, reward, str(next_state))

                state = next_state
                env.print_value_all(self.q_table)

                # if episode ends, then break
                if done:
                    if verbose:
                        print("episode: ", episode,
                              "\tepsilon: ", round(self.epsilon, 2),
                              "\tlearning rate: ", round(self.learning_rate, 2)
                              )
                    break

In [None]:
if __name__ == "__main__":
    env = Env()
    agent = QLearningAgent(actions=list(range(env.n_actions)))
    try:
        agent.mainloop(env, verbose=False)
    except:
        pass

### Results

Q learning does converge to an optimal policy within 60 episodes.

Q learning, being an off policy algorithm, shows very robust results *even when the learning rates do not satisfy the Robbins-Munro sequence condition*.

As you can see from the image below, decoupling the Q function estimator from the policy allows us *not to bootstrap penalizations in other states unnecessarily*.
Negative Q values remain only near the triangles, whereas other Q values of other neighbouring states do not get unnecessarily negative.

<h3 style="text-align:center">After 100 episodes</h3>
<img src="ipynb_results/q_learning_100_episodes.png" alt="q_learning_100_episodes.png" width="50%" />