# Cartpole: A Comparison
This Mini-project's goal is to apply Deep Reinforcement Learning techniques to the Cart-Pole environment. We will implement DQN (REF) as well as improvements Double DQN, and Dueling DQN. The performance of all three will be compared on cartpole. This will hopefully act as a test bed for understanding various improvements.

### Open AI Gym
First I want to import Open AI gym and test that the cartpole environment will work. I will play a game with the policy that the left action will be chosen whenever the velocity of the pole is to the right and vice versa.

In [1]:
import tensorflow as tf
import numpy as np
import gym
import matplotlib.pyplot as plt
import random
%matplotlib inline

In [2]:
env = gym.make('CartPole-v0')
state = env.reset()
done = False
while not done:
    env.render()
    # This policy looks at the angular velocity (state[3]) and applies force in the opposite direction.
    # It is by no means a perfect policy but it is a decent test to ensure its working as expected
    action = 0 if (state[3] < 0) else 1
    state, reward, done, info = env.step(action)
    
env.render(close=True)

### Deep Q Learning (DQN)
The agent below implements DQN as presented in [REF]. Since CartPole is a simple game, The network will be implemented with only 2 hidden layers. The orignal DQN paper introduced two key ideas. 1) The replay-buffer which records (S,A,R,S') experiences and 2) The target of the Q learning (approximation of true Q value) is held fixed and periodically updated.

In order to implement 2) we need to create two identical networks and then copy operations for moving the parameters from one to the other.

In [3]:
# Replay Buffer Class by David Kroezen

from collections import deque

class ReplayBuffer:

    num_state = 3
    
    def __init__(self, buffer_size):
        " Initializes the replay buffer by creating a deque() and setting the size and buffer count. "
        self.buffer = deque()
        self.buffer_size = buffer_size
        self.count = 0
         
    def add(self, s, a, r, d, s2):
         
        """ Adds new experience to the ReplayBuffer(). If the buffer size is
        reached, the oldest item is removed.
         
        Inputs needed to create new experience:
            s      - State
            a      - Action
            r      - Reward
            d      - Done
            s2     - Resulting State     
        """
        d = 1 if d else 0
        # Create experience list
        experience = (s, a, r, d, s2)
        
        # Check the size of the buffer
        if self.count < self.buffer_size:
            self.count += 1
        else:
            self.buffer.popleft()
            
        # Add experience to buffer
        self.buffer.append(experience)
        
    def size(self):
        " Return the amount of stored experiences. " 
        return self.count
    
    def batch(self, batch_size):
        "Return a \"batch_size\" number of random samples from the buffer."
        
        if self.count < batch_size:
            batch = random.sample(self.buffer, self.count)
            batch_size = self.count
        else:
            batch = random.sample(self.buffer, batch_size)
            
        batch_state = np.array([item[0] for item in batch])#.reshape([batch_size,self.num_state])
        batch_action = np.array([item[1] for item in batch])#.reshape([batch_size, 1])
        batch_reward = np.array([item[2] for item in batch])#.reshape([batch_size, 1])
        batch_done = np.array([item[3] for item in batch])#.reshape([batch_size, 1])
        batch_next_state = np.array([item[4] for item in batch])#.reshape([batch_size,self.num_state])
        
        return batch_state, batch_action, batch_reward, batch_done, batch_next_state 
            
    def clear(self):
        " Remove all entries from the ReplayBuffer. "
        self.buffer.clear()
        self.count = 0

In [23]:
# def makeDQN(state, h_size, a_size, name):
#     # Make a network
#     with tf.variable_scope(name):
#         h1 = tf.nn.relu(tf.layers.dense(state, h_size))
#         h2 = tf.nn.relu(tf.layers.dense(h1, h_size))
#         out = tf.layers.dense(h2, a_size)
#     return out

def makeDQN(state, h_size, a_size, name):
    # Make a network
    with tf.variable_scope(name):
        h1 = tf.contrib.layers.fully_connected(state, h_size)
        h2 = tf.contrib.layers.fully_connected(h1, h_size)
        out = tf.contrib.layers.fully_connected(h2, a_size, activation_fn=None)
    return out

def copyVars(fromName, toName):
    # Constructs the copy operations
    fvars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=fromName)
    tvars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=toName)
    
    copy = [tf.assign(t, f) for f, t in zip(fvars, tvars)]
    return copy
    

class DQN():
    def __init__(self, sess, e_size, h_size, a_size, lr=0.01, gamma=0.99, replay_length=10000, batch_size=64):
        
        # Store params
        self.batch_size = batch_size
        self.replay_length = replay_length
        self.sess = sess
        
        # Define Inputs
        self.state = tf.placeholder(tf.float32, shape=[None, e_size], name='State_input')
        self.actions = tf.placeholder(tf.int32, shape=[None], name='actions')
        self.rewards = tf.placeholder(tf.float32, shape=[None], name='rewards')
        self.dones = tf.placeholder(tf.float32, shape=[None], name='dones')
        self.next_state = tf.placeholder(tf.float32, shape=[None, e_size], name='Next_State_input')
        self.targetQ = tf.placeholder(tf.float32, shape=[None], name='QTargets')
        self.gamma = tf.placeholder(tf.float32, name='Gamma')
#         self.learning_rate = tf.placeholder(tf.float32)
        
        # Define Network
        self.main = makeDQN(self.state, h_size, a_size, 'main')
        self.target = makeDQN(self.next_state, h_size, a_size, 'target')
        
        # Create the update operations for updating the target network
        self.copy = copyVars('main', 'target')
        
        # Define the Loss operation 
        # Inline with the udacity implementation I am splitting the target generation operation from
        # The whole computation graph. I'm also using self.main now instead of self.target
        self.makeTargets = self.rewards + self.gamma*(tf.ones_like(self.dones)-self.dones)*tf.reduce_max(self.main, axis=1)
        print (self.makeTargets)
        choice = tf.one_hot(self.actions, a_size)
        print (self.main)
        print (choice)
        self.predictedQ = tf.reduce_sum(self.main*choice, axis=1)
        print (self.predictedQ)
        
        self.loss = tf.reduce_mean(tf.square(self.targetQ-self.predictedQ))
        
        # Optimizer
        # Restrict training to the main network
        tvars = tf.trainable_variables(scope='main')
        self.optimize = tf.train.AdamOptimizer(lr).minimize(self.loss)#, var_list=tvars)
        
        # Helper ops
        self.choice = tf.argmax(self.main, 1)
        self.choice_prob = tf.nn.softmax(self.main)
        
        
        # Set up experience replay buffer
        self.replay = ReplayBuffer(replay_length)
        
    def train(self, gamma=0.99):
        # Get a batch of experiences
        states, actions, rewards, dones, next_states = self.replay.batch(self.batch_size)
        targets = self.sess.run(self.makeTargets, feed_dict={self.state: next_states,
                                                             self.rewards: rewards,
                                                             self.dones: dones,
                                                             self.gamma: gamma})
        predicted = self.sess.run(self.predictedQ, feed_dict={self.state: states,
                                                self.actions: actions})
#         print ("=========================")
#         print (targets)
#         print (predicted)
        Qs = self.sess.run(self.main, feed_dict={self.state: states})
#         print (Qs)
        self.sess.run(self.optimize, feed_dict={self.state: states,
                                                self.actions: actions,
                                                self.targetQ: targets})
    def clean_state(self, s):
        # This really SHOULD convert to a numpy array also
        if (len(s.shape)==1):
            s = s[None, :]
        return s
    
    def remember(self, state, action, reward, done, next_state):
        self.replay.add(state, action, reward, done, next_state)
        
    def choose_action(self, state):
        Qs = self.sess.run(self.main, feed_dict={self.state: self.clean_state(state)})
        action = self.sess.run(self.choice, feed_dict={self.state: self.clean_state(state)})
#         print ("=====================")
#         print (Qs)
#         print (action)
        return action[0]
        
    
    def choose_probs(self, state):
        return self.sess.run(self.choice_prob, feed_dict={self.state: self.clean_state(state)})
    
    def update_target(self):
#         print (self.sess.run(self.main, feed_dict={self.state:[[0.01,0.01,0.01,0.01]]}))
#         print (self.sess.run(self.target, feed_dict={self.next_state:[[0.01,0.01,0.01,0.01]]}))
        self.sess.run(self.copy)
#         print (self.sess.run(self.target, feed_dict={self.next_state:[[0.01,0.01,0.01,0.01]]}))
        
        

### Running DQN
With the agent defined its time to test it

In [25]:
tf.reset_default_graph()

# Hyperparameters
num_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0                   # future reward discount
gamma_wait = 1000
gamma_later = 0.99
train_freq = 1
target_update_freq = None

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_layer = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
buffer_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size

# Reporting Interval
report_freq = 1


env = gym.make('CartPole-v0')
# Environment Parameters
e_size = 4
a_size = 2


def test_player(player, env, runs=100):
    reward_sum = 0
    for i in range(runs):
        run_reward = 0
        state = env.reset()
        done = False
        t = 0.
        while not done:
            t += 1
            action = player.choose_action(state)
            state, reward, done, info = env.step(action)
            run_reward += reward
            if done:
                reward_sum += run_reward
                break
        env.reset()
    return reward_sum/runs



step_count = 0
scores = []
sess = tf.Session()

# Create Agent
player = DQN(sess, e_size, hidden_layer, a_size, learning_rate, gamma, buffer_size, batch_size)

# Initialize 
sess.run(tf.global_variables_initializer())
player.update_target()

for i in range(num_episodes):
    state = env.reset()
    done = False
    t = 0
    while not done:
        step_count += 1
        t += 1
        # Epsilon greedy exploration policy
        epsilon = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step_count) 
        if (np.random.uniform() <= epsilon):
            # Make a random action
            action = env.action_space.sample()
        else:
            action = player.choose_action(state)
            # Do a weighted sample to pick your action
#             action_prob = player.choose_probs(state)[0]
# #                 print (action_prob)
#             action = np.random.choice(range(a_size), p=action_prob)

        new_state, reward, done, info = env.step(action)
#         reward = 0 if done else 1
        
        player.remember(state, action, reward, done, new_state)
        state = new_state
        
        if (step_count == gamma_wait):
            print("Activate!")
            gamma = gamma_later

        if (step_count % train_freq == 0):
            player.train(gamma)

        if (target_update_freq is not None and step_count % target_update_freq == 0):
            print ("updating target")
            player.update_target()

        if done:

            if (i % report_freq == 0 or i == num_episodes-1):
                score = test_player(player, env, runs=32)
                scores.append(score)
                print("Round: {0:7d} \tEpsilon: {1:5.2f} \tTest Score: {2:7.2f}".format(i, epsilon, score))

plt.plot(scores)
plt.show()

Tensor("add:0", dtype=float32)
Tensor("main/fully_connected_2/BiasAdd:0", shape=(?, 2), dtype=float32)
Tensor("one_hot:0", shape=(?, 2), dtype=float32)
Tensor("Sum:0", shape=(?,), dtype=float32)
Round:       0 	Epsilon:  1.00 	Test Score:   11.66
Round:       1 	Epsilon:  1.00 	Test Score:   13.94
Round:       2 	Epsilon:  1.00 	Test Score:   19.31
Round:       3 	Epsilon:  0.99 	Test Score:   26.69
Round:       4 	Epsilon:  0.99 	Test Score:   24.53
Round:       5 	Epsilon:  0.99 	Test Score:   23.19
Round:       6 	Epsilon:  0.99 	Test Score:   21.16
Round:       7 	Epsilon:  0.99 	Test Score:   13.00
Round:       8 	Epsilon:  0.98 	Test Score:   18.25
Round:       9 	Epsilon:  0.98 	Test Score:   17.62
Round:      10 	Epsilon:  0.98 	Test Score:   16.44
Round:      11 	Epsilon:  0.98 	Test Score:   14.59
Round:      12 	Epsilon:  0.98 	Test Score:   18.12
Round:      13 	Epsilon:  0.97 	Test Score:   16.25
Round:      14 	Epsilon:  0.97 	Test Score:   19.31
Round:      15 	Epsilon: 

Round:     155 	Epsilon:  0.72 	Test Score:    9.25
Round:     156 	Epsilon:  0.72 	Test Score:    9.12
Round:     157 	Epsilon:  0.72 	Test Score:    9.41
Round:     158 	Epsilon:  0.72 	Test Score:    9.19
Round:     159 	Epsilon:  0.72 	Test Score:    9.25
Round:     160 	Epsilon:  0.72 	Test Score:    9.34
Round:     161 	Epsilon:  0.72 	Test Score:    9.31
Round:     162 	Epsilon:  0.72 	Test Score:    9.38
Round:     163 	Epsilon:  0.71 	Test Score:    9.28
Round:     164 	Epsilon:  0.71 	Test Score:    9.38
Round:     165 	Epsilon:  0.71 	Test Score:    9.16
Round:     166 	Epsilon:  0.71 	Test Score:    9.44
Round:     167 	Epsilon:  0.71 	Test Score:    9.34
Round:     168 	Epsilon:  0.71 	Test Score:    9.41
Round:     169 	Epsilon:  0.71 	Test Score:    9.34
Round:     170 	Epsilon:  0.70 	Test Score:    9.03
Round:     171 	Epsilon:  0.70 	Test Score:    9.50
Round:     172 	Epsilon:  0.70 	Test Score:    9.31
Round:     173 	Epsilon:  0.70 	Test Score:    9.69
Round:     1

Round:     313 	Epsilon:  0.37 	Test Score:  116.81
Round:     314 	Epsilon:  0.36 	Test Score:  170.44
Round:     315 	Epsilon:  0.36 	Test Score:  167.34
Round:     316 	Epsilon:  0.36 	Test Score:  167.78
Round:     317 	Epsilon:  0.36 	Test Score:  101.03
Round:     318 	Epsilon:  0.35 	Test Score:  115.00
Round:     319 	Epsilon:  0.35 	Test Score:  124.62
Round:     320 	Epsilon:  0.35 	Test Score:  149.12
Round:     321 	Epsilon:  0.35 	Test Score:  168.16
Round:     322 	Epsilon:  0.34 	Test Score:   71.84
Round:     323 	Epsilon:  0.34 	Test Score:  169.25
Round:     324 	Epsilon:  0.34 	Test Score:  136.03
Round:     325 	Epsilon:  0.34 	Test Score:   95.25
Round:     326 	Epsilon:  0.33 	Test Score:  100.25
Round:     327 	Epsilon:  0.33 	Test Score:   87.28
Round:     328 	Epsilon:  0.33 	Test Score:  149.72
Round:     329 	Epsilon:  0.32 	Test Score:  143.12
Round:     330 	Epsilon:  0.32 	Test Score:  114.16
Round:     331 	Epsilon:  0.32 	Test Score:  128.59
Round:     3

KeyboardInterrupt: 