# M2177.003100 Deep Learning <br>Assignment #5 Part 1: Implementing and Training a Deep Q-Network

Copyright (C) Data Science Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by Hyungyu Lee, November 2019

In this notebook, you will implement one of famous reinforcement learning algorithm, Deep Q-Network (DQN) of DeepMind. <br>
The goal here is to understand a basic form of DQN [1, 2] and learn how to use OpenAI Gym toolkit [3].<br>
You need to follow the instructions to implement the given classes.

1. [Play](#play) ( 50 points )

**Note**: certain details are missing or ambiguous on purpose, in order to test your knowledge on the related materials. However, if you really feel that something essential is missing and cannot proceed to the next step, then contact the teaching staff with clear description of your problem.

### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.  
Once you have done **two parts of the assignment**, run the *CollectSubmission.sh* script with your **Team number** as input argument. <br>
This will produce a zipped file called *[Your team number].tar.gz*. Please submit this file on ETL. &nbsp;&nbsp; (Usage: ./*CollectSubmission.sh* &nbsp; Team_#)

### Some helpful references for assignment #4 :
- [1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013). [[pdf]](https://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0ahUKEwiI3aqPjavVAhXBkJQKHZsIDpgQFgg7MAI&url=https%3A%2F%2Fwww.cs.toronto.edu%2F~vmnih%2Fdocs%2Fdqn.pdf&usg=AFQjCNEd1AJoM72DeDpI_GBoPuv7NnVoFA)
- [2] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. [[pdf]](https://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf)
- [3] OpenAI GYM website [[link]](https://gym.openai.com/envs) and [[git]](https://github.com/openai/gym)

## 0. OpenAI Gym

OpenAI Gym is a toolkit to support diverse environments for developing reinforcement learning algorithms. You can use the toolkit with Python as well as TensorFlow. Installation guide of OpenAI Gym is offered by [this link](https://github.com/openai/gym#installation) or just type the command "pip install gym" (as well as "pip install gym[atari]" for Part2). 

After you set up OpenAI Gym, you can use APIs of the toolkit by inserting <font color=red>import gym</font> into your code. In this assignment, you must build one of famous reinforcement learning algorithms whose agent can run on OpenAI Gym environments. Please check how to use APIs such as funcions interacting with environments in the followings.

In [1]:
import tensorflow as tf
import cv2 
import gym
import numpy as np
import os
import argparse
import sys
from collections import deque
import random

#os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
#os.environ["CUDA_VISIBLE_DEVICES"]="0"

In [2]:
# Make an environment instance of CartPole-v0.
env = gym.make('CartPole-v0')

# Before interacting with the environment and starting a new episode, you must reset the environment's state.
state = env.reset()

#rendering game screens, do not need for assignment evaluation
env.render() 

# You can check action space and state (observation) space.
num_actions = env.action_space.n
state_shape = env.observation_space.shape
print(num_actions)
print(state_shape)

# "step" function performs agent's actions given current state of the environment and returns several values.
# Input: action (numerical data)
#        - env.action_space.sample(): select a random action among possible actions.
# Output: next_state (numerical data, next state of the environment after performing given action)
#         reward (numerical data, reward of given action given current state)
#         terminal (boolean data, True means the agent is done in the environment)
next_state, reward, terminal, info = env.step(env.action_space.sample())

[2019-12-08 16:34:25,310] Making new env: CartPole-v0
  result = entry_point.load(False)


2
(4,)


## 1. Implement a DQN agent
## 1) Overview of implementation in the notebook

The assignment is based on a method named by Deep Q-Network (DQN) [1,2]. You could find the details of DQN in the papers. The followings show briefly architecture of DQN and its training computation flow.

- (Pink flow) Play an episode and save transition records of the episode into a replay memory.
- (Green flow) Train DQN so that a loss function in the figure is minimized. The loss function is computed using main Q-network and Target Q-network. Target Q-network needs to be periodically updated by copying the main Q-network.
- (Purple flow) Gradient can be autonomously computed by tensorflow engine, if you build a proper optimizer.

![](image/architecture.png)

There are major 4 components, each of which needs to be implemented in this notebook. The Agent class must have an instance(s) of each class (Environment, DQN, ReplayMemory).
- Environment
- DQN 
- ReplayMemory
- Agent

![](image/components.png)



## 2) Design classes

In the code cells, there are only names of functions which are used in TA's implementation and their brief explanations. <font color='green'>...</font> means that the functions need more arguments and <font color='green'>pass</font> means that you need to write more codes. The functions may be helpful when you do not know how to start the assignment. Of course, you could change the functions such as deleting/adding functions or extending/reducing roles of the classes, <font color='red'> just keeping the existence of the classes</font>.

### Environment class

In [3]:
class Environment(object):
    def __init__(self, args):
        self.env = env
    
    def random_action(self):
        # Return a random action.
        return self.env.action_space.sample()
        
    
    def render_worker(self):
        # If display in your option is true, do rendering. Otherwise, do not.
        # you do not need to render in this assignment
        self.env.render()
    
    def new_episode(self):
        # Start a new episode and return the first state of the new episode.
        return self.env.reset()
    
    def act(self, action):
        # Perform an action which is given by input argument and return the results of acting.
        return self.env.step(action)

### ReplayMemory class

In [4]:
class ReplayMemory(object):
    def __init__(self, args):
        self.capacity = args.memory_size
        self.batch_size = args.batch_size
        self.memory = deque()
    
    def add(self, current_state, action, reward, next_state, terminal):
        # Add current_state, action, reward, terminal to replay_memory (next_state which can be added by your choice). 
        self.memory.append((current_state, action, reward, next_state, terminal))
        if len(self.memory) > self.capacity:
            self.memory.popleft()
    
    def mini_batch(self):
        # Return a mini_batch from replay_memory according to your sampling method. (such as uniform-random sampling in DQN papers)
        return random.sample(self.memory, self.batch_size)
        

### DQN class

In [5]:
class DQN(object):
    def __init__(self, args, sess, name):
        self.sess = sess
        self.lr = args.learning_rate
        self.input_size = args.input_size[0]
        self.output_size = args.output_size
        self.build_network(name)
        self.loss, self.opt = self.build_optimizer(name)
    
    def build_network(self, name):
        # Make your a deep neural network
        with tf.variable_scope(name, reuse=tf.AUTO_REUSE):
            self.inputs = tf.placeholder(tf.float32,shape=(None, self.input_size))
            layer1 = tf.layers.dense(self.inputs, 10, activation=tf.nn.relu,kernel_initializer=tf.contrib.layers.xavier_initializer())
            layer2 = tf.layers.dense(layer1, 20, activation=tf.nn.relu, kernel_initializer=tf.contrib.layers.xavier_initializer())
            self.pred = tf.layers.dense(layer2, self.output_size)
        
    
    def build_optimizer(self,name):
        # Make your optimizer
        with tf.variable_scope(name, reuse=tf.AUTO_REUSE):
            self.outputs = tf.placeholder(tf.float32,[None,self.output_size])
            self.loss = tf.reduce_mean(tf.square(self.outputs-self.pred))
            self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss)
        return self.loss, self.opt
    
    def train_network(self,X,Y):
        # Train the prediction_Q network using a mini-batch sampled from the replay memory
        return self.sess.run([self.loss, self.opt], feed_dict={self.inputs:X, self.outputs:Y})
    
    def update_target_network(self):
        copy_op = []
        pred_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='pred')
        target_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='target')
        for pred_var, target_var in zip(pred_vars, target_vars):
            copy_op.append(target_var.assign(pred_var.value()))
        self.sess.run(copy_op)
    
    def predict_Q(self, state):
        x = np.reshape(state, [-1,self.input_size])
        return self.sess.run(self.pred, feed_dict={self.inputs:x})

### Agent class

In [6]:
class Agent(object):
    def __init__(self, args, sess):
        self.max_episodes = args.max_episodes
        self.train_interval = args.train_interval
        self.sess = sess
        self.discount_factor = 0.99
        self.epsilon = 1.0 
        self.env = Environment(args)
        self.memory = ReplayMemory(args)
        self.input_size = args.input_size
        self.output_size = args.output_size
        self.dqn = DQN(args, self.sess, 'pred')
        self.target_dqn = DQN(args, self.sess, 'target')
        tf.global_variables_initializer().run()
        self.dqn.update_target_network()
        self.saver = tf.train.Saver()
    
    def select_action(self,state):
        # Select an action according ε-greedy. You need to use a random-number generating function and add a library if necessary.
        if np.random.rand(1) <= self.epsilon:
            return self.env.random_action()
        else:
            return np.argmax(self.dqn.predict_Q(state))
    
    def train(self):
        # Train your agent 
        # Several hyper-parameters are determined by your choice
        # Keep epsilon-greedy action selection in your mind 
        for episode in range(self.max_episodes):
            self.epsilon = 1./((episode/3) + 1)
            terminal = False
            count = 0
            state = self.env.new_episode()
            
            while not terminal:
                action = self.select_action(state)
                next_state, reward, terminal, _ = self.env.act(action)
                if terminal:
                    reward = -100
                self.memory.add(state, action, reward, next_state, terminal)
                state = next_state
                count += 1
                if count > 10000:
                    break
            
            print("Episode: {} count: {}".format(episode, count))
            
            if count > 10000:
                pass
            
            if episode % self.train_interval ==1:
                for _ in range(5):
                    batch = self.memory.mini_batch()
                    X = np.empty(0).reshape(0,self.input_size[0])
                    Y = np.empty(0).reshape(0,self.output_size)
                    for state, action, reward, next_state, terminal in batch:
                        Q = self.dqn.predict_Q(state)
                        if terminal:
                            Q[0,action] = reward
                        else:
                            Q[0,action] = reward + self.discount_factor*np.max(self.dqn.predict_Q(next_state))
                        Y = np.vstack([Y,Q])
                        X = np.vstack([X,state])
                    loss, _ = self.dqn.train_network(X,Y)
                print("Loss: ", loss)
                self.dqn.update_target_network()
    
    def play(self):
        # Test your agent 
        # When performing test, you can show the environment's screen by rendering if you want
        state = self.env.new_episode()
        total_reward = 0
        while True:
            self.env.render_worker()
            action= np.argmax(self.dqn.predict_Q(state))
            state, reward, terminal, _ = self.env.act(action)
            total_reward += reward
            if terminal:
                print("Total reward: {}".format(total_reward))
                return total_reward
    
    def save(self):
        checkpoint_dir = 'cartpole'
        if not os.path.exists(checkpoint_dir):
            os.mkdir(checkpoint_dir)
        self.saver.save(self.sess, os.path.join(checkpoint_dir, 'trained_agent'))
        
    def load(self):
        checkpoint_dir = 'cartpole'
        self.saver.restore(self.sess, os.path.join(checkpoint_dir, 'trained_agent'))

## 2. Train your agent 

Now, you train an agent to play CartPole-v0. Options class is the collection of hyper-parameters that you can choice. Usage of Options class is not mandatory.<br>
The maximum value of total reward which can be aquired from one episode is 200. 
<font color='red'>**You should show learning status such as the number of observed states and mean/max/min of rewards frequently (for instance, every 100 states).**</font>

In [9]:
parser = argparse.ArgumentParser(description="CartPole")
parser.add_argument('--env-name', default='CartPole-v0', type=str,
                    help="Environment")
"""
You can add more arguments.
for example, visualize, memory_size, batch_size, discount_factor, eps_max, eps_min, learning_rate, train_interval, copy_interval and so on
"""
parser.add_argument('--input_size', type=int, default=state_shape)
parser.add_argument('--output_size', type=int, default=num_actions)
parser.add_argument('--memory_size', type=int, default=5000)
parser.add_argument('--batch_size', type=int, default=10)
parser.add_argument('--discount_factor', type=float, default=0.9)
parser.add_argument('--learning_rate', type=float, default=1e-1)
parser.add_argument('--train_interval', type=int, default=10)
parser.add_argument('--max_episodes', type=int, default=1200)
sys.argv = ['-f']

config = tf.ConfigProto()
# If you use a GPU, uncomment
#os.environ["CUDA_VISIBLE_DEVICES"] = '0'
config.log_device_placement = False
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
    args = parser.parse_args()
    myAgent = Agent(args, sess) # It depends on your class implementation
    myAgent.train()
    myAgent.save()

Episode: 0 count: 15
Episode: 1 count: 20
Loss:  2.426657
Episode: 2 count: 27
Episode: 3 count: 27
Episode: 4 count: 30
Episode: 5 count: 18
Episode: 6 count: 30
Episode: 7 count: 30
Episode: 8 count: 20
Episode: 9 count: 24
Episode: 10 count: 27
Episode: 11 count: 24
Loss:  516.10046
Episode: 12 count: 12
Episode: 13 count: 9
Episode: 14 count: 14
Episode: 15 count: 10
Episode: 16 count: 11
Episode: 17 count: 10
Episode: 18 count: 9
Episode: 19 count: 11
Episode: 20 count: 10
Episode: 21 count: 10
Loss:  1.4268222
Episode: 22 count: 8
Episode: 23 count: 11
Episode: 24 count: 10
Episode: 25 count: 10
Episode: 26 count: 8
Episode: 27 count: 10
Episode: 28 count: 10
Episode: 29 count: 9
Episode: 30 count: 11
Episode: 31 count: 12
Loss:  521.22845
Episode: 32 count: 9
Episode: 33 count: 12
Episode: 34 count: 10
Episode: 35 count: 10
Episode: 36 count: 9
Episode: 37 count: 9
Episode: 38 count: 10
Episode: 39 count: 9
Episode: 40 count: 10
Episode: 41 count: 10
Loss:  494.77936
Episode: 42

Episode: 352 count: 10
Episode: 353 count: 10
Episode: 354 count: 9
Episode: 355 count: 9
Episode: 356 count: 9
Episode: 357 count: 9
Episode: 358 count: 10
Episode: 359 count: 11
Episode: 360 count: 10
Episode: 361 count: 10
Loss:  221.09761
Episode: 362 count: 9
Episode: 363 count: 8
Episode: 364 count: 8
Episode: 365 count: 9
Episode: 366 count: 10
Episode: 367 count: 10
Episode: 368 count: 8
Episode: 369 count: 10
Episode: 370 count: 9
Episode: 371 count: 10
Loss:  115.46681
Episode: 372 count: 9
Episode: 373 count: 9
Episode: 374 count: 10
Episode: 375 count: 10
Episode: 376 count: 8
Episode: 377 count: 10
Episode: 378 count: 10
Episode: 379 count: 10
Episode: 380 count: 10
Episode: 381 count: 10
Loss:  15.208049
Episode: 382 count: 9
Episode: 383 count: 11
Episode: 384 count: 10
Episode: 385 count: 8
Episode: 386 count: 9
Episode: 387 count: 10
Episode: 388 count: 10
Episode: 389 count: 10
Episode: 390 count: 8
Episode: 391 count: 10
Loss:  51.90572
Episode: 392 count: 10
Episode

Episode: 692 count: 38
Episode: 693 count: 94
Episode: 694 count: 55
Episode: 695 count: 41
Episode: 696 count: 29
Episode: 697 count: 149
Episode: 698 count: 32
Episode: 699 count: 27
Episode: 700 count: 47
Episode: 701 count: 37
Loss:  25.224682
Episode: 702 count: 24
Episode: 703 count: 33
Episode: 704 count: 16
Episode: 705 count: 18
Episode: 706 count: 35
Episode: 707 count: 27
Episode: 708 count: 30
Episode: 709 count: 38
Episode: 710 count: 20
Episode: 711 count: 78
Loss:  6.680272
Episode: 712 count: 74
Episode: 713 count: 72
Episode: 714 count: 78
Episode: 715 count: 46
Episode: 716 count: 38
Episode: 717 count: 36
Episode: 718 count: 54
Episode: 719 count: 68
Episode: 720 count: 78
Episode: 721 count: 55
Loss:  3.3950806
Episode: 722 count: 28
Episode: 723 count: 45
Episode: 724 count: 31
Episode: 725 count: 33
Episode: 726 count: 34
Episode: 727 count: 40
Episode: 728 count: 30
Episode: 729 count: 33
Episode: 730 count: 31
Episode: 731 count: 51
Loss:  16.51898
Episode: 732 

Episode: 1022 count: 89
Episode: 1023 count: 71
Episode: 1024 count: 89
Episode: 1025 count: 62
Episode: 1026 count: 86
Episode: 1027 count: 67
Episode: 1028 count: 81
Episode: 1029 count: 200
Episode: 1030 count: 82
Episode: 1031 count: 200
Loss:  9.236241
Episode: 1032 count: 83
Episode: 1033 count: 67
Episode: 1034 count: 65
Episode: 1035 count: 92
Episode: 1036 count: 68
Episode: 1037 count: 58
Episode: 1038 count: 200
Episode: 1039 count: 77
Episode: 1040 count: 72
Episode: 1041 count: 78
Loss:  408.12946
Episode: 1042 count: 118
Episode: 1043 count: 119
Episode: 1044 count: 147
Episode: 1045 count: 109
Episode: 1046 count: 134
Episode: 1047 count: 192
Episode: 1048 count: 111
Episode: 1049 count: 114
Episode: 1050 count: 119
Episode: 1051 count: 163
Loss:  5.383857
Episode: 1052 count: 200
Episode: 1053 count: 200
Episode: 1054 count: 178
Episode: 1055 count: 200
Episode: 1056 count: 146
Episode: 1057 count: 194
Episode: 1058 count: 167
Episode: 1059 count: 200
Episode: 1060 coun

## <a name="play"></a> 3. Test the trained agent ( 50 points )

Now, we test your agent and calculate an average reward of 20 episodes.
- 0 <= average reward < 50 : you can get 0 points
- 50 <= average reward < 100 : you can get 10 points
- 100 <= average reward < 190 : you can get 35 points
- 190 <= average reward <= 200 : you can get 50 points

In [10]:
config = tf.ConfigProto()
# If you use a GPU, uncomment
#os.environ["CUDA_VISIBLE_DEVICES"] = '0'
config.log_device_placement = False
config.gpu_options.allow_growth = True
tf.reset_default_graph()
with tf.Session(config=config) as sess:
    args = parser.parse_args() # You set the option of test phase
    myAgent = Agent(args, sess) # It depends on your class implementation
    myAgent.load()
    rewards = []
    for i in range(20):
        r = myAgent.play() # play() returns the reward cumulated in one episode
        rewards.append(r)
    mean = np.mean(rewards)
    print(rewards)
    print(mean)

INFO:tensorflow:Restoring parameters from cartpole/trained_agent


[2019-12-08 16:38:19,152] Restoring parameters from cartpole/trained_agent


Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
Total reward: 200.0
[200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0]
200.0
