## Double deep Q networks
Double deep Q-networks implement the [double Q-learning](../8-double-q-learning/double_q_learning_agent.ipynb) algorithm with a neural network for Q values approximation.
To prevent overestimation and reduce maximization bias, DDQN decouples the action selection from Q value evaluation.
In a DDQN there are two neural networks at play, one current network and one older version of that network.
The first network is always improving and is used to select the action, meaning that it uses the maximization operator over the Q values to select the best action.
The older network is used to evaluate the V value of that action.
Finally, every now and then the old network is set equal to the current network, meaning that the old network gets updated to a better state, but less frequently than in simple DQN.
DDQN is shown to reduce bias and improve performance on the same set of problems that DQN is used.
DDQN can be further improved using an experience replay buffer.

### Characteristics of double deep Q network
Double deep Q network inherits all the characteristics of [double Q learning](../8-double-q-learning/double_q_learning_agent.ipynb), except that it allows us to solve problems with continuous or very large state and action spaces.

##### Neural Network as value function approximator
This can be any neural network architecture that we see fit.
In this example, we are using a very simple Dense Neural Network architecture.
Dense Neural Networks are simply a *stack of matrices with activation functions in between*.

##### Replay memory
Each tuple of $(s_t, a_t, r, s_{t+1})$ is recorded in a replay memory.
As a replay memory we use a FIFO dequeue of constant size.

##### Batch training
Each step taken by the agent, we feed a *random* sample of tuples from the replay memory to the neural network in order to do *batch training*.

##### Main model and Target model
We keep two neural network models in our implementation.
The reason we keep a second one, called the target model, is to provide some **stability** while training.
 - with each step taken we update only the first main model.
 - after some time interval (in this case, every episode), we update the second target model to be the same with the main model.

##### Continuous or very large state and action spaces
Using a neural network as a value function approximator has its benefits.
It allows us to solve problems with continuous state and action spaces.
Such problems would be almost impossible to solve with normal Q learning, as they are very resource hungry.

##### Decoupling action selection and evaluation
This is the main difference with [deep Q network](../11-dqn/cartpole_dqn.ipynb).
In double deep Q network, like in deep Q network we maintain two neural networks: one main model and one target model.
In double deep Q network, selection of best action is done from the current model, whereas the evaluation of that action is done from the target model.
This has proven to reduce maximisation bias.
```
target = self.model.predict(curr_states)
target_next = self.model.predict(next_states)
target_val = self.target_model.predict(next_states)

for i in range(self.batch_size):
    # like Q Learning, get maximum Q value at s'
    # But from target model
    if done[i]:
        target[i][action[i]] = reward[i]
    else:
        # the key point of Double DQN
        # selection of best action is from model
        a = np.argmax(target_next[i])
        # evaluation is from target model
        target[i][action[i]] = reward[i] + self.discount_factor * (
            target_val[i][a])
```
By contrast, in deep Q network, selection of best action and evaluation of that action are both done in the target model.
```
target = self.model.predict(curr_states)
target_val = self.target_model.predict(next_states)
for i in range(self.batch_size):
    # Q Learning: get maximum Q value at s' from target model
    if done[i]:
        target[i][action[i]] = reward[i]
    else:
        # selection of best action is from *target* model
        # evaluation is also from target model
        target[i][action[i]] = reward[i] + self.discount_factor * (
            np.amax(target_val[i]))
```

##### Initialization
For the double Q learning aspect we keep track of the following:
 - In order to showcase how robust off policy algorithms like Q learning are, we are going to keep the epsilon and learning rate constant
   - `self.learning_rate` is set to $0.001$
   - `self.epsilon` is initially set to $1.0$ and decays each step taken via the variable `self.epsilon_decay = 0.999` until a minimum of `self.epsilon_min = 0.01` is reached
 - `self.discount_factor` is set to $0.99$
 - We keep track of a **replay memory** of size $2000$ via the following `self.memory = deque(maxlen=2000)`

For the neural network aspect we keep track of the following:
 - we keep track of a model (neural network) in `self.model = self.build_model()`
 - we also keep track of a target model `self.target_model = self.build_model()`.
 After some time interval we update the target model to be the same with the main model to provide some stability.
 Also note that here, the target network is used to only *evaluate* the selected actions.
 - we only train after there are at least $1000$ entries in the replay memory.
 We specify this at `self.train_start = 1000`.

In [None]:
import sys
import gym
import pylab
import random
import numpy as np
from collections import deque
from keras.layers import Dense
from keras.optimizers import Adam
from keras.models import Sequential

EPISODES = 300


# Double DQN Agent for the Cartpole
# it uses Neural Network to approximate q function
# and replay memory & target q network
class DoubleDQNAgent:
    def __init__(self, state_size, action_size):
        # if you want to see Cartpole learning, then change to True
        self.render = False
        self.load_model = False
        # get size of state and action
        self.state_size = state_size
        self.action_size = action_size

        # these is hyper parameters for the Double DQN
        self.discount_factor = 0.99
        self.learning_rate = 0.001
        self.learning_rate_decay = 0.0
        self.epsilon = 1.0
        self.epsilon_decay = 0.999
        self.epsilon_min = 0.01
        self.batch_size = 64
        self.train_start = 1000
        # create replay memory using deque
        # contains <s,a,r,s'> tuples
        self.memory = deque(maxlen=2000)

        # create main model and target model
        self.model = self.build_model()
        self.target_model = self.build_model()

        # initialize target model
        self.next_states_model()

        if self.load_model:
            self.model.load_weights("./save_model/cartpole_ddqn.h5")

### Double deep Q network

The update rule for Q values in double deep Q network follows the same logic as in [double Q network](../8-double-q-learning/double_q_learning_agent.ipynb):

$\hat{Q}^\pi(s_t, a_t) \gets^{train} R_t+ \gamma \hat{Q}^\pi_{target}(s_{t+1}, argmax_{a’} \hat{Q}_{current}^\pi(s_t ,a’))$
 - $\hat{Q}^\pi(s_t, a_t)$ - *predicted* Q value of current state-action pair following the policy $\pi$
 - $\gets^{train}$ - this means train the neural network accordingly, instead of *assign* the value
 - $\gamma$ - the **discount factor**.
 - $argmax_{a’} \hat{Q}_{current}^\pi(s_t ,a’)$ - this is the *action selection step*, done by using the current network model.
 argmax operator over the **predicted** Q values of all possible actions in the current state.
 - $\hat{Q}^\pi_{target}(...)$ - this is the *action evaluation* step.

First things first, the update formula is very similar to the update rule we saw in double Q learning, although with the following differences:
 - instead of $Q^\pi$ we now deal with $\hat{Q}^\pi$, which is an approximation, the output of a neural network.
 - after calculating the result, we do not *assign* the value to $\hat{Q}^\pi$, but rather *train* the neural network with *gradient descent* in order to update the weights with the latest state-action value.

Moreover, this seems like a simplified version of the update rule we saw for double Q learning.
Notice that the right hand side of the formula is simply a **bootstrapped return**.
We use this value to update the network and we do not take into account the temporal difference between Q values.

The reason behind this is that in double deep Q network, the presence of a neural network will mimic the temporal difference formula we saw on Q learning with a *learning rate*.
In a neural network we update the weights via *backpropagation* in *gradient descent*.
We also specify a **learning rate** in the process.
That is why we do not need the explicit temporal difference aspect of Q learning anymore, since a similar implicit process is provided by backpropagation when we train the neural network with new data.

Moreover, keep in mind these two characteristics of training that we also mentioned above:
 - replay memory
 - batch training
 - stability with target networks

In [None]:
class DoubleDQNAgent(DoubleDQNAgent):
    # pick samples randomly from replay memory (with batch_size)
    def train_model(self):
        if len(self.memory) < self.train_start:
            return
        batch_size = min(self.batch_size, len(self.memory))
        mini_batch = random.sample(self.memory, batch_size)

        # get s (state) as input from mini_batch
        # initialize with shape batch_size x state_size
        curr_states = np.zeros((batch_size, self.state_size))
        # get s' (next state) as input from mini_batch
        # initialize with shape batch_size x state_size
        next_states = np.zeros((batch_size, self.state_size))
        action, reward, done = [], [], []

        for i in range(batch_size):
            # get s (state) as input from mini_batch
            curr_states[i] = mini_batch[i][0]
            action.append(mini_batch[i][1])
            reward.append(mini_batch[i][2])
            # get s' (next state) as input from mini_batch
            next_states[i] = mini_batch[i][3]
            done.append(mini_batch[i][4])

        target = self.model.predict(curr_states)
        target_next = self.model.predict(next_states)
        target_val = self.target_model.predict(next_states)

        for i in range(self.batch_size):
            # like Q Learning, get maximum Q value at s'
            # But from target model
            if done[i]:
                target[i][action[i]] = reward[i]
            else:
                # the key point of Double DQN
                # selection of best action is from model
                a = np.argmax(target_next[i])
                # evaluation is from target model
                target[i][action[i]] = reward[i] + self.discount_factor * (
                    target_val[i][a])

        # make minibatch which includes target q value and predicted q value
        # and do the model fit!
        self.model.fit(curr_states, target, batch_size=self.batch_size,
                       epochs=1, verbose=0)

##### The model: neural network
Neural networks are built one layer after the other.
Our model is a dense neural network, i.e. a neural network comprised of only dense layers.
Dense layers are simply a *set of matrices* with an *activation function* in the end.

In [None]:
class DoubleDQNAgent(DoubleDQNAgent):
    # approximate Q function using Neural Network
    # state is input and Q Value of each action is output of network
    def build_model(self):
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu',
                        kernel_initializer='he_uniform'))
        model.add(Dense(24, activation='relu',
                        kernel_initializer='he_uniform'))
        model.add(Dense(self.action_size, activation='linear',
                        kernel_initializer='he_uniform'))
        model.summary()
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate, decay=self.learning_rate_decay))
        return model

##### Setting target network for stability

In [None]:
class DoubleDQNAgent(DoubleDQNAgent):
    # after some time interval update the target model to be same with model
    def next_states_model(self):
        self.target_model.set_weights(self.model.get_weights())

##### Helper methods

In [None]:
class DoubleDQNAgent(DoubleDQNAgent):
    # get action from model using epsilon-greedy policy
    def get_action(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        else:
            q_value = self.model.predict(state)
            return np.argmax(q_value[0])

In [None]:
class DoubleDQNAgent(DoubleDQNAgent):
    # save sample <s,a,r,s'> to the replay memory
    def append_sample(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

##### Main loop

In [None]:
if __name__ == "__main__":
    # In case of CartPole-v1, you can play until 500 time step
    env = gym.make('CartPole-v1')
    # get size of state and action from environment
    state_size = env.observation_space.shape[0]
    action_size = env.action_space.n

    agent = DoubleDQNAgent(state_size, action_size)

    scores, episodes = [], []

    for e in range(EPISODES):
        done = False
        score = 0
        state = env.reset()
        state = np.reshape(state, [1, state_size])

        while not done:
            if agent.render:
                env.render()

            # get action for the current state and go one step in environment
            action = agent.get_action(state)
            next_state, reward, done, info = env.step(action)
            next_state = np.reshape(next_state, [1, state_size])
            # if an action make the episode end, then gives penalty of -100
            reward = reward if not done or score == 499 else -100

            # save the sample <s, a, r, s'> to the replay memory
            agent.append_sample(state, action, reward, next_state, done)
            # every time step do the training
            agent.train_model()
            score += reward
            state = next_state

            if done:
                # every episode update the target model to be same with model
                agent.next_states_model()

                # every episode, plot the play time
                score = score if score == 500 else score + 100
                scores.append(score)
                episodes.append(e)
                pylab.plot(episodes, scores, 'b')
                pylab.savefig("./save_graph/cartpole_ddqn2.png")
                print("episode:", e, "  score:", score, "  memory length:",
                      len(agent.memory), "  epsilon:", agent.epsilon)

                # if the mean of scores of last 10 episode is bigger than 490
                # stop training
                if np.mean(scores[-min(10, len(scores)):]) > 490:
                    sys.exit()

        # save the model
        if e % 50 == 0:
            agent.model.save_weights("./save_model/cartpole_ddqn2.h5")

<br/>
<h3 style="text-align:center">Results</h3>
<img src="./save_graph/cartpole_ddqn.png" alt="cartpole_ddqn.png" width="70%" />

We can see that the graph of DDQN does not differ too much from the graph shown in [DQN](../11-dqn/cartpole_dqn.ipynb).
There is clearly less maximization bias, although the fact that we start training after 80 episodes means that some differences between the upper and lower bound of scores in the beginning is normal.

We can see that the lower bounds do converge almost to the same rate as the lower bounds of DQN.
Hitting the 500 mark is also less frequent, since in DDQN there is no maximization bias.

Nevertheless, we can clearly see that maximization bias is not counterproductive in the case of CartPole-v1 environment.