## Double Q learning
Double Q learning is an improvement over the standard Q learning algorithm that solves the maximization bias problem from which Q-learning suffers.
In double Q learning, we maintain two independent unbiased estimates, Q1 and Q2 and when we use one to select the maximum, we can use the other to get an estimate of the value of this maximum.
With 0.5 probability we update Q1 and with 0.5 probability we update Q2.

The update rules now become the following two formulas:
 - $Q_1(s_t, a_t) \gets Q_1(s_t, a_t) + \alpha(r_t+ \gamma Q_2(s_{t+1}, argmax_{a’} Q_1(s_t, a’)) - Q_1(s_t,a_t))$
 - $Q_2(s_t, a_t) \gets Q_2(s_t, a_t) + \alpha(r_t+ \gamma Q_1(s_{t+1}, argmax_{a’} Q_2(s_t, a’)) - Q_2(s_t,a_t))$

### Characteristics of Double Q learning:
 - Double Q learning inherits all the characteristics of [Q learning](../7-q-learning/q_learning_agent.ipynb), except **maximization bias**.
 - It is an improvement of Q learning that addresses a solution specifically for maximization bias.

##### No maximization bias
We maintain two independent unbiased estimates, Q1 and Q2 and when we use one to select the maximum, we can use the other to get an estimate of the value of this maximum.
Decoupling taking the max and estimating the value of the max can get rid of maximization bias.

##### Significantly faster training than normal Q learning
Double Q learning can significantly speed up training time by eliminating suboptimal actions more quickly than normal Q learning.

##### Initialization
For Q learning we keep track of the following:
 - Instead of having one dictionary of Q values like in Q learning, in double Q learning we keep **two** dictionaries of Q values.
 Both Q value functions are initially set to $0$.
 - In order to showcase how robust off policy algorithms like Q learning are, we are going to keep the epsilon and learning rate constant.
   - `self.learning_rate` is set to $0.4$.
   - `self.epsilon` is set to $0.1$.
 - `self.discount_factor` is set to $0.9$.

In [None]:
import numpy as np
import random
from environment import Env
from collections import defaultdict

class QLearningAgent:
    def __init__(self, actions):
        # actions = [0, 1, 2, 3]
        self.actions = actions
        self.learning_rate = 0.4
        self.discount_factor = 0.9
        self.epsilon = 0.1
        self.qA_table = defaultdict(lambda: [0.0, 0.0, 0.0, 0.0])
        self.qB_table = defaultdict(lambda: [0.0, 0.0, 0.0, 0.0])

### Double Q learning

Double Q learning is an off policy algorithm that combines sampling with bootstrapping in an off policy algorithm.

Instead of having one dictionary of Q values like in Q learning, in double Q learning we keep **two** dictionaries of Q values.
The update rule for Q values in double Q learning is the following:
- $Q_1(s_t, a_t) \gets Q_1(s_t, a_t) + \alpha(r_t+ \gamma Q_2(s_{t+1}, argmax_{a’} Q_1(s_t, a’)) - Q_1(s_t,a_t))$
- $Q_2(s_t, a_t) \gets Q_2(s_t, a_t) + \alpha(r_t+ \gamma Q_1(s_{t+1}, argmax_{a’} Q_2(s_t, a’)) - Q_2(s_t,a_t))$

##### Action selection step via argmax operator
We use one of the networks to select the action leading to the maximum, for example:
$argmax_{a’} Q_1(s_t, a’)$

##### Q value estimation step
We then use the other network to *estimate* the Q function:
$Q_2(s_{t+1}, argmax_{a’} Q_1(s_t, a’))$

##### Update step
We then update the Q value of the network used in action selection:
$Q_2(s_{t+1}, argmax_{a’} Q_1(s_t, a’))$.
With $0.5$ probability we update $Q_1$ and with $0.5$ probability we update $Q_2$.


Simple as that.
 - $Q_x(s_t, a_t)$ - Q value of current state-action pair following the policy $\pi$
 - $Q_x(s_{t+1}, argmax_{a’} Q_y(s_t, a’))$ - action selected at $Q_y$ is evaluated at $Q_x$
 - $\alpha$ - the **learning rate**.
 Learning rate can take any value int the range $[0...1]$.
 Values closer to 0 mean that we put more value to older experiences, whereas values closer to 1 means that we put more value to latest experiences.
 In our case, the learning rate takes the value $0.4$.
 - $r_t$ - the reward at time-step $t$.
 - $\gamma$ - the **discount factor**.
 Traditionally used when calculating returns, now it is used when calculating **expectancies of returns**, i.e. state values.

The difference $r_t+ \gamma Q_2(s_{t+1}, argmax_{a’} Q_1(s_t, a’)) - Q_1(s_t,a_t)$ is commonly referred to as the **TD error**.

The sum $r_t+ \gamma Q_2(s_{t+1}, argmax_{a’} Q_1(s_t, a’)$ is referred to as the **TD target**.


In [None]:
class QLearningAgent(QLearningAgent):
    # update q function with sample <s, a, r, s'>
    def learn(self, state, action, reward, next_state):
        # choose which table will be updated randomly
        if np.random.rand() < 0.5:
            q_table = self.qA_table
        else:
            q_table = self.qB_table

        current_q = q_table[state][action]
        # using Bellman Optimality Equation to update q function
        QL_Target = reward + self.discount_factor * max(q_table[next_state])
        QL_Error = QL_Target - current_q
        q_table[state][action] = current_q + self.learning_rate * QL_Error

### Other methods

##### Helper methods

In [None]:
class QLearningAgent(QLearningAgent):
    # get action for the state according to the q function table
    # agent pick action of epsilon-greedy policy
    def get_action(self, state):
        if np.random.rand() < self.epsilon:
            # take random action
            action = np.random.choice(self.actions)
        else:
            # take action according to the q function tables
            state_action_A = self.qA_table[state]
            state_action_B = self.qB_table[state]
            state_action_ABsum = [sum(x) for x in zip(state_action_A, state_action_B)]
            action = self.arg_max(state_action_ABsum)
        return action

In [None]:
class QLearningAgent(QLearningAgent):
    @staticmethod
    def arg_max(state_action):
        max_index_list = []
        max_value = state_action[0]
        for index, value in enumerate(state_action):
            if value > max_value:
                max_index_list.clear()
                max_value = value
                max_index_list.append(index)
            elif value == max_value:
                max_index_list.append(index)
        return random.choice(max_index_list)

##### Main loop

In [None]:
class QLearningAgent(QLearningAgent):
    def mainloop(self, env, verbose=False):
        for episode in range(1000):
            state = env.reset()

            while True:
                env.render()

                # take action and proceed one step in the environment
                action = self.get_action(str(state))
                next_state, reward, done = env.step(action)

                # with sample <s,a,r,s'>, agent learns new q function
                self.learn(str(state), action, reward, str(next_state))

                state = next_state
                env.print_value_all(self.qA_table, self.qB_table)

                # if episode ends, then break
                if done:
                    if verbose:
                        print("episode: ", episode,
                              "\tepsilon: ", round(self.epsilon, 2),
                              "\tlearning rate: ", round(self.learning_rate, 2)
                              )
                    break

In [None]:
if __name__ == "__main__":
    env = Env()
    agent = QLearningAgent(actions=list(range(env.n_actions)))
    try:
        agent.mainloop(env, verbose=False)
    except:
        pass

### Results

Double Q learning does converge to an optimal policy within 60 episodes.

Double Q learning, being an off policy algorithm, shows very robust results *even when the learning rates do not satisfy the Robbins-Munro sequence condition*.

In the image below you can see the Q values of the agent after 100 episodes.
The reason why there are Q values of above 100 up to 200, is that in double Q learning, having two networks, the sum of the Q values of the two networks for each state make up the Q value of that state.
That being said, they are double the actual expectancies for each state.

This in turn does nothing to our agent.
 - all the Q values have an upper bound, and in this case twice the maximal reward of 100.
 - the Q values retain the same **true** ratio with each other, **without** a maximization bias like in Q learning.

<h3 style="text-align:center">After 100 episodes</h3>
<img src="ipynb_results/double_q_learning_100_episodes.png" alt="double_q_learning_100_episodes.png" width="50%" />
