# DDPG

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import tensorflow as tf
print('TensorFlow Version: {}'.format(tf.__version__))
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym
env = gym.make('BipedalWalker-v2')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


  result = entry_point.load(False)


We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
env.observation_space, env.action_space

(Box(24,), Box(4,))

In [4]:
# state = env.reset()
# batch = []
# for _ in range(1111):
#     #env.render()
#     action = env.action_space.sample()
#     next_state, reward, done, _ = env.step(action) # take a random action
#     batch.append([state, action, next_state, reward, float(done)])
#     state = next_state
#     if done:
#         state = env.reset()

To shut the window showing the simulation, use `env.close()`.

In [5]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [6]:
# import numpy as np
# actions = np.array([each[0] for each in batch])
# states = np.array([each[1] for each in batch])
# rewards = np.array([each[2] for each in batch])
# dones = np.array([each[3] for each in batch])
# infos = np.array([each[4] for each in batch])

In [7]:
# print('states:', np.max(np.array(states)), np.min(np.array(states)))
# print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
# print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

In [8]:
# env.action_space.high, env.action_space.low

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [9]:
env.observation_space.dtype, env.action_space.dtype

(dtype('float32'), dtype('float32'))

In [10]:
env.observation_space, env.action_space

(Box(24,), Box(4,))

In [11]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float64, [None, state_size], name='states')
    actions = tf.placeholder(tf.float64, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float64, [None], name='targetQs')
    isTraining = tf.placeholder(tf.bool, [], name='isTraining')
    return states, actions, targetQs, isTraining

In [12]:
env.action_space.low, env.action_space.high

(array([-1., -1., -1., -1.], dtype=float32),
 array([1., 1., 1., 1.], dtype=float32))

In [13]:
def actor(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('actor', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        return logits

In [14]:
def actor_target(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('actor_target', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        return logits

In [15]:
env.observation_space.low, env.observation_space.high

(array([-inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
        -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
        -inf, -inf], dtype=float32),
 array([inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
        inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
       dtype=float32))

In [16]:
def critic(states, actions, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('critic', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=action_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        nl1_fused = tf.concat(axis=1, values=[nl1, actions])
        h2 = tf.layers.dense(inputs=nl1_fused, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)
        return logits

In [17]:
def critic_target(states, actions, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('critic_target', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=action_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        nl1_fused = tf.concat(axis=1, values=[nl1, actions])
        h2 = tf.layers.dense(inputs=nl1_fused, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)
        return logits

In [26]:
def model_loss(actions, states, targetQs, action_size, hidden_size, isTraining):
    #######################################################################
    actions_target = actor_target(states=states, hidden_size=hidden_size, action_size=action_size, 
                                  training=isTraining)
    gQlogits_target = critic_target(states=states, actions=actions_target, hidden_size=hidden_size, 
                                    action_size=action_size, training=isTraining)
    ###################################################################
    actions_logits = actor(states=states, hidden_size=hidden_size, action_size=action_size, 
                           training=isTraining)
    gQlogits = critic(states=states, actions=actions_logits, hidden_size=hidden_size, action_size=action_size,
                      training=isTraining)
    ###########################################################################
    Qlogits = critic(states=states, actions=actions, hidden_size=hidden_size, action_size=action_size, 
                     training=isTraining, reuse=True)
    ###########################################################################
    Qs = tf.reshape(Qlogits, shape=[-1])
    gQs = tf.reshape(gQlogits, shape=[-1])
    dloss = tf.reduce_mean(tf.square(Qs - targetQs))
    gloss = -tf.reduce_mean(gQs)
    ################################################################
    gQs_target = tf.reshape(gQlogits_target, shape=[-1])
    return actions_target, gQs_target, gloss, dloss

In [27]:
def model_opt(gloss, dloss, g_learning_rate, d_learning_rate, gamma):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('actor')]
    d_vars = [var for var in t_vars if var.name.startswith('critic')]
    g_vars_tgt = [var for var in t_vars if var.name.startswith('actor_target')]
    d_vars_tgt = [var for var in t_vars if var.name.startswith('critic_target')]

    #     # # Optimize RNN
    #     #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    #     g_grads = tf.gradients(gloss, g_vars)
    #     d_grads = tf.gradients(dloss, d_vars)
    #     g_opt = tf.train.AdamOptimizer(g_learning_rate).apply_gradients(grads_and_vars=zip(g_grads, g_vars))
    #     d_opt = tf.train.AdamOptimizer(d_learning_rate).apply_gradients(grads_and_vars=zip(d_grads, d_vars))
    
    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(g_learning_rate).minimize(gloss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(d_learning_rate).minimize(dloss, var_list=d_vars)
    
    # Soft update/ NOT hard update
    # θ_target = τ*θ + (1 - τ)*θ_target
    g_opt_tgt = [g_vars_tgt[i].assign((g_vars[i]*gamma) + (g_vars_tgt[i]*(1 - gamma)))
                 for i in range(len(g_vars_tgt))]
    d_opt_tgt = [d_vars_tgt[i].assign((d_vars[i]*gamma) + (d_vars_tgt[i]*(1 - gamma))) 
                 for i in range(len(d_vars_tgt))]

    return g_opt, d_opt, g_opt_tgt, d_opt_tgt

In [28]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, g_learning_rate, d_learning_rate, gamma):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, self.isTraining = model_input(state_size=state_size, 
                                                                                action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_tgt, self.gQ_target, self.gloss, self.dloss = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init
            states=self.states, actions=self.actions, targetQs=self.targetQs, isTraining=self.isTraining)
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt, self.g_opt_tgt, self.d_opt_tgt = model_opt(gloss=self.gloss, dloss=self.dloss,
                                                                           g_learning_rate=g_learning_rate, 
                                                                           d_learning_rate=d_learning_rate, 
                                                                           gamma=gamma)

In [29]:
import numpy as np
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, replace=False)
        return [self.buffer[ii] for ii in idx]

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [30]:
env.observation_space, env.action_space, \
env.observation_space.low, env.observation_space.high, \
env.action_space.low, env.action_space.high

(Box(24,),
 Box(4,),
 array([-inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
        -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
        -inf, -inf], dtype=float32),
 array([inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
        inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
       dtype=float32),
 array([-1., -1., -1., -1.], dtype=float32),
 array([1., 1., 1., 1.], dtype=float32))

In [31]:
# Network parameters
state_size = 24
action_size = 4
hidden_size = 24*2             # number of units in each Q-network hidden layer
g_learning_rate = 1e-4         # Q-network learning rate
d_learning_rate = 1e-4         # Q-network learning rate

# Memory parameters
memory_size = int(1e6)            # memory capacity
batch_size = int(1e3)             # experience mini-batch size == one episode size is 1000/int(1e3) steps
gamma = 0.99                   # future reward discount

In [32]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size,
              g_learning_rate=g_learning_rate, d_learning_rate=d_learning_rate, gamma=gamma)

# Init the memory
memory = Memory(max_size=memory_size)

In [33]:
env.observation_space.high, env.observation_space.low, env.observation_space, \
env.action_space.high, env.action_space.low, env.action_space, \
env.reward_range

(array([inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
        inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
       dtype=float32),
 array([-inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
        -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
        -inf, -inf], dtype=float32),
 Box(24,),
 array([1., 1., 1., 1.], dtype=float32),
 array([-1., -1., -1., -1.], dtype=float32),
 Box(4,),
 (-inf, inf))

In [34]:
# state = env.reset()

# for each_step in range(memory_size):
#     #env.render()

#     action = env.action_space.sample() # randomness
#     action = np.clip(action, -1, 1) # clipped: [-1, +1]
#     next_state, reward, done, _ = env.step(action)
#     reward_in = 0
#     memory.buffer.append([state, action, next_state, reward, float(done), reward_in])
#     state = next_state
    
#     if done is True:
#         print('Progress:', each_step/memory_size)
#         state = env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [35]:
len(memory.buffer), memory.buffer.maxlen

(0, 1000000)

In [38]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list = [], []
gloss_list, dloss_list = [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    episode_reward = deque(maxlen=100) # 100 episodes for running average/running mean/window
    n_episodes=1
    max_t=1000000
    goal_score=30

    # Training episodes/epochs
    for ep in range(n_episodes):
        gloss_batch, dloss_batch = [], []
        
        state = env.reset()
        #print(states.shape)
        total_reward = 0                          # initialize the score (for each agent)
        
        # Training steps/batches
        for t in range(max_t):
            action_tgt = sess.run(model.actions_tgt, feed_dict={model.states:state.reshape([1, -1]),
                                                                 model.isTraining: False})
            noise = np.random.normal(loc=0, scale=0.1, size=action_size)
            action_tgt += noise
            
            action = np.clip(action_tgt, -1, 1) # [-1, +1]
            #print(actions.shape)
            
            next_state, reward, done, _ = env.step(action)
            
            memory.buffer.append([state, action, next_state, reward, done])

            total_reward += reward                         # update the score (for each agent)
            state = next_state                               # roll over states to next time step
            
            # Training
            if len(memory.buffer) >= batch_size:
                batch = memory.sample(batch_size)
                states = np.array([each[0] for each in batch])
                actions = np.array([each[1] for each in batch])
                next_states = np.array([each[2] for each in batch])
                rewards = np.array([each[3] for each in batch])
                dones = np.array([each[4] for each in batch])

                gQ_target = sess.run(model.gQ_target, feed_dict = {model.states: next_states, 
                                                                     model.isTraining: False})
                #nextQs = nextQlogits.reshape(-1)
                targetQs = rewards + (gamma * gQ_target * (1-dones))

                feed_dict = {model.states: states, model.actions: actions, model.targetQs: targetQs,
                             model.isTraining: True}
                dloss, _= sess.run([model.dloss, model.d_opt], feed_dict)
                gloss, _= sess.run([model.gloss, model.g_opt], feed_dict)
                gloss_batch.append(gloss)
                dloss_batch.append(dloss)
                
                # Update target networks
                sess.run(model.d_opt_tgt)
                sess.run(model.g_opt_tgt)
            
            # End of episode
            if done is True:
                break
                
        # Print out
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)))
        
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.
        if np.mean(episode_reward) >= goal_score:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

TypeError: only size-1 arrays can be converted to Python scalars

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(gloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(dloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [31]:
import gym
env = gym.make('BipedalWalker-v2')

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model2.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(1):
        state = env.reset()
        total_reward = 0

        # Steps/batches
        while True:
            env.render()
            actions_pred = sess.run(model.actions_pred, feed_dict={model.states: state.reshape([1, -1])})
            action = action_pred.reshape([-1])
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                print('total_reward: {}'.format(total_reward))
                break
                
env.close()

INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt
total_reward: -104.69566602519279


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.