##### Symposium "Recent Advances in Deep Learning Systems", Reisensburg/UUlm, 05.11.2019 - 07.11.2019
##### Heinke Hihn, Institute for Neural Information Processing, UUlm

# Human-level control through deep reinforcement learning
## Deep Q-Networks
https://daiwk.github.io/assets/dqn.pdf

**Abstract**
>The theory of reinforcement learning provides a normative account1, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains6 their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

## Deep Q-Learning
Q-Learning (Sutton & Barto, 2018) is a Reinforcement Learning Algorithm that aims to learn the future discounted reward in state $s$ while executing action $a$, i.e. the Q-function $Q(s,a)$ is defined as

$$Q(s_t, a_t) = \max_\pi R_{t+1},$$

where $\pi$ is the agents policy and $R_t$ is the cumulative reward from step $t$ onward:

$$R_t = r_t + r_{t+1} + ... + r_{n},$$

because the horizon $n$ can be infinte we introduce a discount factor $\gamma \in [0,1]$ to make the sum converge:

$$R_t =  \gamma^{t}r_t + \gamma^{t+1}t_{t+1} + ... + \gamma^{t+1}t_{n},$$

so we can express this as

$$R_t=r_t+\gamma (r_{t+1}+\gamma (r_{t+2}+…))=r_t+\gamma R_{t+1}.$$

A way to think of this is the following: the further away the reward is the less we value it. An optimal strategy $\pi^*$ would in this formulation always choose an action that maximizes the cumulative discounted future reward.
We can learn the optimal Q-function via the **Bellman Equation**:

$$Q(s,a)=r + \gamma max_{a’}Q(s’,a’)$$

It can summarized as folows: the maximum future reward for this state and action is the immediate reward plus **maximum** future reward for the next state.

We can learn the Q-function by parametrizing it via a neural network and performing regression on the following regression (Mnih et al., 2015):
$$L=\frac{1}{2}\left(\underbrace{r + \gamma max_{a'}Q(s',a')}_{\text{target}} - \underbrace{Q(s,a)}_{\text{prediction}}\right)^2$$

Basically we can achieve this by iterating the following steps:

1. Perform feedforward pass for the current state $s$ to get predicted Q-values
2. Perform second feedforward pass for the next state $s'$ and find maximum over $Q(s', \cdot)$
3. Set Q-value target for action a to $r+γmaxa′Q(s′,a′)$ . **Important**: For the remaining actions set the target to the same as originally returned from step 1 (0 error, so no update)
4. Update the weights using backpropagation on regression loss

## Experience Replay
The most important trick to stabilize learning in DQNs is experience replay. During gameplay all the experience tuples  $(s,a,r,s')$ are stored in a replay memory. During training we now sample a batch of $N$ samples uniformly from the experience memory and use that batch for updating. In this way the samples are not correlated and we can view this as a simple supervised regression problem. The samples used for training are independent and identically distributed (i.i.d.), which helps in training. There are several versions and improvements of experience replays, e.g. prioritized experience replay (Schaul et al., 2015) and hindsight experience replay(Andrychowicz et al., 2017).

## Target Network
We use two deep networks $\theta^-$ and $\theta$. We use the first one to retrieve Q values while the second one includes all updates in the training. After a certain amount of updates, we set $\theta^- \leftarrow \theta$. The purpose is to fix the Q-value targets temporarily so we don't have a moving target to chase. In addition, parameter changes do not impact $\theta^-$ immediately and therefore even the input may not be i.i.d., it will not incorrectly magnify its effect as mentioned before. The Bellmann loss becomes

$$L=\frac{1}{2}\left(\underbrace{r + \gamma max_{a'}Q_{\theta^-}(s',a')}_{\text{target}} - \underbrace{Q_{\theta}(s,a)}_{\text{prediction}}\right)^2,$$

where $Q_{\theta^-}$ is the Q-function given by the target network and $Q_{\theta}$ the Q-function given by the DQN.

### References 
**(Schaul et al., 2015)** Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.

**(Andrychowicz et al., 2017)** Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., ... & Zaremba, W. (2017). Hindsight experience replay. In Advances in Neural Information Processing Systems (pp. 5048-5058).


**(Mnih et al., 2015)** Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529.

**(Sutton & Barto, 2018)** Sutton, R.S.; Barto, A.G. *Reinforcement learning: An introduction*; MIT press, 2018.

# Implementation

In [None]:
%tensorflow_version 2.x
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers, losses
from tensorflow.keras import Model
import numpy as np
import random
from collections import deque
import gym

## DQN Model

In [None]:
class DQNModel(Model):
    def __init__(self):
        super(DQNModel, self).__init__()
        self.layer1 = Dense(64, activation='relu')
        self.layer2 = Dense(64, activation='relu')
        self.value = Dense(num_action)

    def call(self, state):
        layer1 = self.layer1(state)
        layer2 = self.layer2(layer1)
        value = self.value(layer2)
        return value

## Training

In [1]:
class Trainer:
    def __init__(self):
        # hyper parameters
        self.lr = 0.001
        self.lr2 = 0.001
        self.gamma = 0.99

        # create model and target model
        self.dqn_model = DQNModel()
        self.dqn_target = DQNModel()
        self.opt = optimizers.Adam(lr=self.lr)

        # epsilon-greedy action selection
        # with decaying epsilon
        self.epsilon = 1.0
        self.epsilon_decay = 0.999
        self.epsilon_min = 0.01
        self.batch_size = 64
        # we start training after we have at least 1000 tuples in the replay memory
        self.train_start = 1000
        self.state_size = state_size
        # replay memory
        self.memory = deque(maxlen=2000)

    def update_target(self):
        """
        Updates the target network
        :return: none
        """
        self.dqn_target.set_weights(self.dqn_model.get_weights())

    def get_action(self, state):
        """
        Select an action given a state. Implements epsilon-greedy strategy
        :param state: state input
        :return: selected action
        """
        if np.random.rand() <= self.epsilon:
            return random.randrange(num_action)
        else:
            q_value = self.dqn_model(tf.convert_to_tensor(state[None, :], dtype=tf.float32))
            return np.argmax(q_value[0])

    def append_tuple(self, state, action, reward, next_state, done):
        """
        Append a tuple to experience buffer
        :param state: obsvered state
        :param action: executed action
        :param reward: reward recieved
        :param next_state: state reached after exectuting a in s
        :param done: whether or not environment reached a terminal state
        :return: none
        """
        self.memory.append((state, action, reward, next_state, done))

    def train(self):
        """
        Trains the DQN model.
        :return: none
        """
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

        # experience replay
        mini_batch = random.sample(self.memory, self.batch_size)

        states = np.zeros((self.batch_size, self.state_size))
        next_states = np.zeros((self.batch_size, self.state_size))
        actions, rewards, dones = [], [], []
        # split batch
        for i in range(self.batch_size):
            states[i] = mini_batch[i][0]
            actions.append(mini_batch[i][1])
            rewards.append(mini_batch[i][2])
            next_states[i] = mini_batch[i][3]
            dones.append(mini_batch[i][4])

        dqn_variable = self.dqn_model.trainable_variables

        # compute gradients and uptadate model parameters
        with tf.GradientTape() as tape:
            tape.watch(dqn_variable)
            # get targets
            target = self.dqn_model(tf.convert_to_tensor(np.vstack(states), dtype=tf.float32))
            target_val = self.dqn_target(tf.convert_to_tensor(np.vstack(next_states), dtype=tf.float32))

            target = np.array(target)
            target_val = np.array(target_val)

            for i in range(self.batch_size):
                next_v = np.array(target_val[i]).max()
                # q-learning
                if dones[i]:
                    # if terminal state just take the reward
                    target[i][actions[i]] = rewards[i]
                else:
                    # if not terminal state we also take discounted future reward into account
                    target[i][actions[i]] = rewards[i] + self.gamma * next_v

            values = self.dqn_model(tf.convert_to_tensor(np.vstack(states), dtype=tf.float32))
            error = tf.square(values - target) * 0.5
            error = tf.reduce_mean(error)

        dqn_grads = tape.gradient(error, dqn_variable)
        self.opt.apply_gradients(zip(dqn_grads, dqn_variable))

    def run(self, env):

        max_ep_len = 500
        episodes = 1000

        state = env.reset()
        state = np.reshape(state, [1, state_size])

        for e in range(episodes):
            total_reward = 0
            for t in range(max_ep_len):
                action = self.get_action(state)
                next_state, reward, done, info = env.step(action)
                next_state = np.reshape(next_state, [1, state_size])

                #env.render()

                # if 
                if t == max_ep_len :
                    done = True
                # 
                if t < max_ep_len and done :
                    reward = -1

                total_reward += reward
                self.append_tuple(state, action, reward, next_state, done)
                
                # train if memory has enought tuples
                if len(self.memory) >= self.train_start:
                    self.train()

                total_reward += reward
                state = next_state
                
                if done:
                    # if done update target network if necessary
                    self.update_target()
                    print("e : ", e, " reward : ", total_reward, " step : ", t)
                    env.reset()
                    break

## Putting it all together

In [None]:
env = gym.make('CartPole-v0')
num_action = env.action_space.n
state_size = env.observation_space.shape[0]

DQN = Trainer()
DQN.run(env)

# Hackathon Project
Implement an DQN agent that learns to solve the Pong game (gym.make("Pong-v0"))