# Deep Q-Network (DQN)

The original paper published on Nature can be found [here](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf). And there are two exercellent learning materials for it can be found from [@Hvass-Labs](https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/16_Reinforcement_Learning.ipynb) and [@dennybritz](https://github.com/ZhiruiFeng/reinforcement-learning/tree/master/DQN). 

As metioned by its creator, DeepMind, its meaning lies on:
> This work represents the first demonstration of a general-purpose agent that is able to continually adapt its behavior without any human intervention, a major technical step forward in the quest for general AI.

<img src="./resource/images/dqn-1.png" alt="Drawing" style="width: 800px;"/>

The above figure is borrowed from a [Nature New&View article](https://www.nature.com/articles/518486a.epdf?referrer_access_token=rJi2LNPaO_wh7LCXE8J0gNRgN0jAjWel9jnR3ZoTv0M4DtkukdMkIcR-UVrz0pNp311MkppKL7NysMmwcju-Md7bwkauG8hqmn4c75o_6pA%3D&tracking_referrer=http%3A%2F%2Fwww.nature.com%2Fnews%2Fnewsandviews) which talks about DQN.

Till now, it has become the scaffold of Deep Reinforcement Learning, and offers the core architecture for further improvements, which makes it's really worthwile for everyone who want to work on RL to have a deep and thoroughly look at this model.

And within project **RL-League**, you will find nearlly all other complex models are extensions of this one.

## What is DQN?

We can see DQN as Q-learning algorithm with a Neural Network plays the role of action-value function approximator, which enables the agent learn to choose actions from high-dimentional raw-image input. In Q-learning, $Q^*(s,a)$ represents the accumulated future reward, $Q^*$, if in state $s$ the system first performs action $a$, and subsequently follows an optimal policy. For small state space, tabular RL uses a look-up table to update $Q(s,a)$, which fails when the state space become very large. So here in DQN, the system tries to approximate $Q^*$ by using an artificial neural network - a function approximator to do generalization on states.

The loss function for the function approximator - deep q-network is: $$\mathcal{L}_i(w_i) = \mathbb{E}_{s,a,r,s'\thicksim\mathcal{D}_i}[(r+\gamma \max_{a'}Q(s',a';w_i^-)-Q(s,a;w_i))]^2$$

In this equation: 
- $i$ indicate the time; 
- $\mathcal{D}_i$ means the replay buffer for [experience replay](https://medium.com/@Fihezro/experience-replay-deep-reinforcement-learning-ef02b8a1383); 
- $s'$ is the next state after taking action $a$ at state $s$, note that in dynamic environment, the $s'$ may not be determined by $s$ and $a$; 
- $w_i^-$ is the parameter of network for Q-learning target, which will not be updated frequently, but when updated, it will be set as $w_i$, the parameter of Q-Network;
- $\gamma$ is the discount factor;
- $r+\gamma \max_{a'}Q(s',a';w_i^-)$ means the 'ground truth', assume the optimalization of later actions;
- $Q(s,a;w_i)$ is the evaluation result of Q-network, and the $w_i$ is the parameter updated through gradient descent.

## Structure of this notebook

As we mentioned in the introduction of this project, we are going to use a modular view to see these RL models, you can find the reason [here](http://fzruniverse.life/2018/03/24/Modular-Architecture-for-Implementing-RL-Agent/).

<img src="./resource/images/dqn-2.png" alt="Drawing" style="width: 400px;"/>

Modules are implemented in `/agents/modules/`, here we will just invoke these class and combine them into DQN agent. And an encapsulation of DQN agent can be find under `/agents/`. You can use it to play on `/field/`, where those agents solve problems in virtual game simulators.

## Implementation

In [None]:
import gym
from gym.wrappers import Monitor
import numpy as np
import sys
import os

if "../agents/modules" not in sys.path:
    sys.path.append("../agents/modules")
from accessory import LogQValues, LogReward, 
from sensedisposer import MotionTracer
from controller import EpsilonGreedy
from approximator import NeuralNetwork
from replaymemory import ReplayMemory

In [None]:
class Agent:
    
    def __init__(self, env_name, training, render=False, use_logging=True):
        
        self.env = gym.make(env.name)
        self.num_actions = self.env.action_space.n
        self.training = training
        self.render = render
        self.use_logging = use_logging
        
        # Log-classes are used for restored the logs
        if self.use_logging and self.training:
            self.log_q_values = LogQValues()
            self.log_reward = LogReward()
        else:
            self.log_q_values = None
            self.log_reward = None
        
        self.action_names = self.env.unwrapped.get_action_meanings()
        
        # The probability of choosing a random action should decrease durin the training time.
        self.epsilon_greedy = EpsilonGreedy(start_value=1.0,
                                            end_value=0.1,
                                            num_iterations=1e6,
                                            num_actions=self.num_actions,
                                            epsilon_testing=0.01)
        
        # Setting of hyper-parameters 
        if self.training:
            self.learning_rate_control = LinearControlSignal(start_value=1e-3,
                                                             end_value=1e-5,
                                                             num_iterations=5e6)
            self.loss_limit_control = LinearControlSignal(start_value=0.1,
                                                          end_value=0.015,
                                                          num_iterations=5e6)
            self.max_epochs_control = LinearControlSignal(start_value=5.0,
                                                          end_value=10.0,
                                                          num_iterations=5e6)
            self.replay_fraction = LinearControlSignal(start_value=0.1,
                                                       end_value=1.0,
                                                       num_iterations=5e6)
        else:
            self.learning_rate_control = None
            self.loss_limit_control = None
            self.max_epochs_control = None
            self.replay_fraction = None
        
        # Setting of replay memory
        if self.training:
            self.replay_memory = ReplayMemory(size=200000, num_actions=self.num_actions)
        else:
            self.replay_memory = None
        
        # Create the Neural Network used for estimating Q-values.
        # TODO: implementation of the approximator.
        self.model = NeuralNetwork(num_actions=self.num_actions,
                                   replay_memory=self.replay_memory)
        # Log of the rewards
        self.episode_rewards = []
        
    def reset_episode_reward(self):
        """Reset the log of episode-rewards."""
        self.episode_rewards = []
    
    def get_action_name(self, action):
        """Return the name of an action."""
        return self.action_names[action]
    
    def get_lives(self):
        """Get the number of lives the agent has in the game-environment."""
        return self.env.unwrapped.ale.lives()
    
    def run(self, num_episodes=None):
        """
        Run the game-environment and use the Neural Network to decide
        which actions to take in each step through Q-value estimates.
        
        :param num_episodes: 
            Number of episodes to process in the game-environment.
            If None then continue forever. This is useful during training
            where you might want to stop the training using Ctrl-C instead.
        """
        end_episode = True
        count_states = self.model.get_count_states()
        count_episodes = self.model.get_count_episodes()
        if num_episodes is None:
            num_episodes = float('inf')
        else:
            num_episodes += count_episodes
        
        while count_episodes <= num_episodes:
            if end_episode:
                img = self.env.reset()
                motion_tracer = MotionTracer(img)
                reward_episode = 0.0
                count_episodes = self.model.increase_count_episodes()
                num_lives = self.get_lives()
            state = motion_tracer.get_state()
            q_values = self.model.get_q_values(states=[state])[0]
            action, epsilon = self.epsilon_greedy.get_action(q_values=q_values,
                                                             iteration=count_states,
                                                             training=self.training)
            img, reward, end_episode, info = self.env.step(action=action)
            motion_tracer.process(image=img)
            reward_episode += reward
            num_lives_new = self.get_lives()
            end_life = (num_lives_new < num_lives)
            num_lives = num_lives_new
            count_states = self.model.increase_count_states()
            
            if not self.training and self.render:
                # Help you see the performance when test.
                self.env.render()
                time.sleep(0.01)
            if self.training:
                self.replay_memory.add(state=state,
                                       q_values=q_values,
                                       action=action,
                                       reward=reward,
                                       end_life=end_life,
                                       end_episode=end_episode)
                use_fraction = self.replay_fraction.get_value(iteration=count_states)
                if self.replay_memory.is_full() or self.replay_memory.used_fraction() > use_fraction:
                    self.replay_memory.update_all_q_values()
                    if self.use_logging:
                        self.log_q_values.write(count_episodes=count_episodes,
                                                count_states=count_states,
                                                q_values=self.replay_memory.q_values)
                    learning_rate = self.learning_rate_control.get_value(iteration=count_states)
                    loss_limit = self.loss_limit_control.get_value(iteration=count_states)
                    max_epochs = self.max_epochs_control.get_value(iteration=count_states)
                    self.model.optimize(learning_rate=learning_rate,
                                        loss_limit=loss_limit,
                                        max_epochs=max_epochs)
                    self.model.save_checkpoint(count_states)
                    self.replay_memory.reset()
                
            if end_episode:
                self.episode_rewards.append(reward_episode)
            
            if len(self.episode_rewards) == 0:
                reward_mean = 0.0
            else:
                reward_mean = np.mean(self.episode_reward[-30:])
            
            if self.training and end_episode:
                # Log reward to file.
                if self.use_logging:
                    self.log_reward.write(count_episodes=count_episodes,
                                          count_states=count_states,
                                          reward_episode=reward_episode,
                                          reward_mean=reward_mean)
                # Print reward to screen.
                msg = "{0:4}:{1}\t Epsilon: {2:4.2f}\t Reward: {3:.1f}\t Episode Mean: {4:.1f}"
                print(msg.format(count_episodes, count_states, epsilon,
                                 reward_episode, reward_mean))
            elif not self.training and (reward != 0.0 or end_life or end_episode):
                msg = "{0:4}:{1}\tQ-min: {2:5.3f}\tQ-max: {3:5.3f}\tLives: {4}\tReward: {5:.1f}\tEpisode Mean: {6:.1f}"
                print(msg.format(count_episodes, count_states, np.min(q_values),
                                 np.max(q_values), num_lives, reward_episode, reward_mean))

## Test

In [None]:
import argparse

# Get the arguments.
env_name = 'Breakout-v0'
training = True
render = False
num_episodes = 100
checkpoint_base_dir = "./experiment/checkpoint"

# TODO solve the match of path.

agent = Agent(env_name=env_name,
              training=training,
              render=render)
agent.run(num_episodes=num_episodes)

rewards = agent.episode_rewards
print()  # Newline.
print("Rewards for {0} episodes:".format(len(rewards)))
print("- Min:   ", np.min(rewards))
print("- Mean:  ", np.mean(rewards))
print("- Max:   ", np.max(rewards))
print("- Stdev: ", np.std(rewards))