# DAQL for Acrobat

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import tensorflow as tf
print('TensorFlow Version: {}'.format(tf.__version__))
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.12.0
Default GPU Device: /device:GPU:0


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym
env = gym.make('Pendulum-v0')

We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
env.observation_space, env.action_space

(Box(3,), Box(1,))

In [4]:
# state = env.reset()
# batch = []
# for _ in range(1111):
#     #env.render()
#     action = env.action_space.sample()
#     next_state, reward, done, _ = env.step(action) # take a random action
#     batch.append([state, action, next_state, reward, float(done)])
#     state = next_state
#     if done:
#         state = env.reset()

To shut the window showing the simulation, use `env.close()`.

In [5]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [6]:
env.action_space.high, env.action_space.low

(array([2.], dtype=float32), array([-2.], dtype=float32))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [7]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_shape], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.float32, [None, action_size], name='actions')
    next_states = tf.placeholder(tf.float32, [None, state_size], name='next_state')
    rewards = tf.placeholder(tf.float32, [None], name='rewards')
    dones = tf.placeholder(tf.float32, [None], name='dones')
    return states, actions, next_states, rewards, dones

In [8]:
# a = act(s)
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False, trainable=True):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size, trainable=trainable)
        bn1 = tf.layers.batch_normalization(h1, training=training, trainable=trainable)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size, trainable=trainable)
        bn2 = tf.layers.batch_normalization(h2, training=training, trainable=trainable)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size, trainable=trainable)        
        pred = tf.tanh(logits)
        return pred

In [9]:
# Q = env(s, a)
def discriminator(states, actions, state_size, action_size, hidden_size, reuse=False, alpha=0.1, training=False, 
                  trainable=True):
    with tf.variable_scope('discriminator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=action_size, trainable=trainable)
        bn1 = tf.layers.batch_normalization(h1, training=training, trainable=trainable)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        fused = tf.concat(axis=1, values=[nl1, actions])
        h2 = tf.layers.dense(inputs=fused, units=hidden_size, trainable=trainable)
        bn2 = tf.layers.batch_normalization(h2, training=training, trainable=trainable)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        #next_states_logits = tf.layers.dense(inputs=nl2, units=state_size, trainable=trainable)        
        logits = tf.layers.dense(inputs=nl2, units=1, trainable=trainable)        
        return logits

In [11]:
def model_loss(action_size, hidden_size, state_size, gamma,
               states, actions, next_states, rewards, dones):
    #########################################################
    actions_pred = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    #########################################
    Qlogits = discriminator(actions=actions, states=states, hidden_size=hidden_size, 
                            action_size=action_size, state_size=state_size)
    Qs = tf.reshape(Qlogits, shape=[-1])
    print('Qs.shape, Qlogits.shape:', Qs.shape, Qlogits.shape)
    ########################################
    next_actions_pred = generator(states=next_states, hidden_size=hidden_size, action_size=action_size, 
                                  reuse=True)
    nextQlogits = discriminator(actions=next_actions_pred, states=next_states, hidden_size=hidden_size,
                                action_size=action_size, state_size=state_size, reuse=True)
    nextQs = tf.reshape(nextQlogits, shape=[-1])
    targetQs = rewards + (gamma * nextQs * (1-dones))
    print('nextQlogits.shape, nextQs.shape, targetQs.shape:', nextQlogits.shape, nextQs.shape, targetQs.shape)
    #loss = tf.reduce_mean(tf.square(Qs - targetQs))
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_pred, loss

In [12]:
# Optimizating/training/learning G & D
def model_opt(loss, g_learning_rate, d_learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(g_learning_rate).minimize(-loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(d_learning_rate).minimize(loss, var_list=d_vars)
    return g_opt, d_opt

In [13]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, g_learning_rate, d_learning_rate, gamma):
        
        # Model inputs
        self.states, self.actions, self.next_states, self.rewards, self.dones = model_input(
            state_size=state_size, action_size=action_size)

        # Model loss/objective
        self.actions_pred, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, state_size=state_size, gamma=gamma,
            states=self.states, actions=self.actions, next_states=self.next_states,
            rewards=self.rewards, dones=self.dones)
        
        # Model optimization/update
        self.g_opt, self.d_opt = model_opt(loss=self.loss,
                                           g_learning_rate=g_learning_rate, 
                                           d_learning_rate=d_learning_rate)

In [14]:
import numpy as np
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, replace=False)
        return [self.buffer[ii] for ii in idx]

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [15]:
env.observation_space, env.action_space

(Box(3,), Box(1,))

In [16]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01           # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 3
action_size = 1
hidden_size = 24*2             # number of units in each Q-network hidden layer
g_learning_rate = 1e-4         # Q-network learning rate
d_learning_rate = 1e-4         # Q-network learning rate

# Memory parameters
memory_size = int(1e5)            # memory capacity
batch_size = int(1e2)             # experience mini-batch size == one episode size is 1000/int(1e3) steps
gamma = 0.99                   # future reward discount

In [17]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, gamma=gamma,
              g_learning_rate=g_learning_rate, d_learning_rate=d_learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

Qs.shape, Qlogits.shape: (?,) (?, 1)
nextQlogits.shape, nextQs.shape, targetQs.shape: (?, 1) (?,) (?,)


In [18]:
env.observation_space.high, env.observation_space.low, env.observation_space, \
env.action_space.high, env.action_space.low, env.action_space, \
env.reward_range

(array([1., 1., 8.], dtype=float32),
 array([-1., -1., -8.], dtype=float32),
 Box(3,),
 array([2.], dtype=float32),
 array([-2.], dtype=float32),
 Box(1,),
 (-inf, inf))

In [19]:
state = env.reset()

for each_step in range(memory_size):
    #env.render()

    action = env.action_space.sample() # randomness
    action = np.clip(action, -2, 2) # clipped: [-2, +2]
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    
    if done is True:
        print('Progress:', each_step/memory_size)
        state = env.reset()

Progress: 0.00199
Progress: 0.00399
Progress: 0.00599
Progress: 0.00799
Progress: 0.00999
Progress: 0.01199
Progress: 0.01399
Progress: 0.01599
Progress: 0.01799
Progress: 0.01999
Progress: 0.02199
Progress: 0.02399
Progress: 0.02599
Progress: 0.02799
Progress: 0.02999
Progress: 0.03199
Progress: 0.03399
Progress: 0.03599
Progress: 0.03799
Progress: 0.03999
Progress: 0.04199
Progress: 0.04399
Progress: 0.04599
Progress: 0.04799
Progress: 0.04999
Progress: 0.05199
Progress: 0.05399
Progress: 0.05599
Progress: 0.05799
Progress: 0.05999
Progress: 0.06199
Progress: 0.06399
Progress: 0.06599
Progress: 0.06799
Progress: 0.06999
Progress: 0.07199
Progress: 0.07399
Progress: 0.07599
Progress: 0.07799
Progress: 0.07999
Progress: 0.08199
Progress: 0.08399
Progress: 0.08599
Progress: 0.08799
Progress: 0.08999
Progress: 0.09199
Progress: 0.09399
Progress: 0.09599
Progress: 0.09799
Progress: 0.09999
Progress: 0.10199
Progress: 0.10399
Progress: 0.10599
Progress: 0.10799
Progress: 0.10999
Progress: 

Progress: 0.96399
Progress: 0.96599
Progress: 0.96799
Progress: 0.96999
Progress: 0.97199
Progress: 0.97399
Progress: 0.97599
Progress: 0.97799
Progress: 0.97999
Progress: 0.98199
Progress: 0.98399
Progress: 0.98599
Progress: 0.98799
Progress: 0.98999
Progress: 0.99199
Progress: 0.99399
Progress: 0.99599
Progress: 0.99799
Progress: 0.99999


## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [20]:
len(memory.buffer), memory.buffer.maxlen

(100000, 100000)

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, gloss_list, dloss_list = [], [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    episode_reward = deque(maxlen=100)
    
    # Training episodes/epochs
    for ep in range(1111):
        gloss_batch, dloss_batch = [], []
        total_reward = 0
        state = env.reset()

        # Training steps/batches
        while True:
            env.render()
            
            action_pred = sess.run(model.actions_pred, feed_dict={model.states: state.reshape([1, -1])})
            noise = np.random.normal(loc=0, scale=0.1, size=action_size) # randomness
            action = action_pred.reshape([-1]) + noise
            #print(action.shape, action_logits.shape, noise.shape)
            action = np.clip(action, -2, 2) # clipped: [-2, +2]
            
            next_state, reward, done, _ = env.step(action)
            # reward_in_pred = sess.run(model.rewards_in_pred, feed_dict={model.states: state.reshape([1, -1]), 
            #                                                              model.actions: action.reshape([1, -1]), 
            #                                                              model.next_states: next_state.reshape([1, -1])})

            # reward_in = reward_in_pred[0]
            memory.buffer.append([state, action, next_state, reward, float(done)])
            total_reward += reward
            state = next_state
            #print('reward, reward_in:', reward, reward_in)
            
            # Training
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            feed_dict = {model.states: states, 
                         model.actions: actions,
                         model.next_states: next_states, 
                         model.rewards: rewards, 
                         model.dones: dones}
            loss, _, _ = sess.run([model.loss, model.d_opt, model.g_opt], feed_dict)
            gloss_batch.append(-loss)
            dloss_batch.append(loss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        # Break episode/epoch loop
        # Did not solve the environment. 
        # Best 100-episode average reward was 220.62 ± 0.69. 
        #  when the agent obtains an average reward of at least 300 over 100 consecutive episodes.)        
        if np.mean(episode_reward) >= 300:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:-1558.3876 R:-1558.3876 gloss:-49.8854 dloss:49.8854
Episode:1 meanR:-1632.8479 R:-1707.3081 gloss:-50.1747 dloss:50.1747
Episode:2 meanR:-1519.8231 R:-1293.7735 gloss:-50.1909 dloss:50.1909
Episode:3 meanR:-1535.0705 R:-1580.8129 gloss:-48.3300 dloss:48.3300
Episode:4 meanR:-1464.3175 R:-1181.3053 gloss:-51.1917 dloss:51.1917
Episode:5 meanR:-1493.8941 R:-1641.7769 gloss:-47.2204 dloss:47.2204
Episode:6 meanR:-1472.8032 R:-1346.2581 gloss:-42.5222 dloss:42.5222
Episode:7 meanR:-1450.7133 R:-1296.0842 gloss:-41.3973 dloss:41.3973
Episode:8 meanR:-1487.5449 R:-1782.1978 gloss:-58.6790 dloss:58.6790
Episode:9 meanR:-1466.7998 R:-1280.0934 gloss:-54.1786 dloss:54.1786
Episode:10 meanR:-1439.7472 R:-1169.2217 gloss:-48.9611 dloss:48.9611
Episode:11 meanR:-1435.9749 R:-1394.4797 gloss:-44.0393 dloss:44.0393
Episode:12 meanR:-1433.5880 R:-1404.9442 gloss:-40.1840 dloss:40.1840
Episode:13 meanR:-1425.7067 R:-1323.2499 gloss:-37.1215 dloss:37.1215
Episode:14 meanR:-1416.3577 R:

Episode:118 meanR:-1294.6625 R:-1151.9733 gloss:-35.4804 dloss:35.4804
Episode:119 meanR:-1292.3909 R:-1053.2319 gloss:-34.5654 dloss:34.5654
Episode:120 meanR:-1291.3490 R:-1332.1465 gloss:-34.6424 dloss:34.6424
Episode:121 meanR:-1291.1471 R:-1349.9150 gloss:-34.7134 dloss:34.7134
Episode:122 meanR:-1284.2550 R:-1059.4173 gloss:-34.5392 dloss:34.5392
Episode:123 meanR:-1285.1899 R:-1443.6753 gloss:-34.9497 dloss:34.9497
Episode:124 meanR:-1286.3049 R:-1360.4665 gloss:-35.1007 dloss:35.1007
Episode:125 meanR:-1286.6023 R:-1306.9252 gloss:-34.8189 dloss:34.8189
Episode:126 meanR:-1288.3378 R:-1033.7160 gloss:-34.3138 dloss:34.3138
Episode:127 meanR:-1289.6808 R:-1336.3527 gloss:-34.9979 dloss:34.9979
Episode:128 meanR:-1286.0321 R:-1173.7377 gloss:-34.7754 dloss:34.7754
Episode:129 meanR:-1288.0332 R:-1232.9721 gloss:-34.4988 dloss:34.4988
Episode:130 meanR:-1289.1510 R:-1170.6898 gloss:-34.2318 dloss:34.2318
Episode:131 meanR:-1292.0815 R:-1575.0262 gloss:-35.0103 dloss:35.0103
Episod

Episode:234 meanR:-1311.4653 R:-1305.0745 gloss:-35.6213 dloss:35.6213
Episode:235 meanR:-1310.7570 R:-1373.2202 gloss:-36.0473 dloss:36.0473
Episode:236 meanR:-1310.6022 R:-1324.6558 gloss:-35.7912 dloss:35.7912
Episode:237 meanR:-1307.5410 R:-1030.5988 gloss:-35.7260 dloss:35.7260
Episode:238 meanR:-1310.1128 R:-1341.7192 gloss:-35.9011 dloss:35.9011
Episode:239 meanR:-1309.9603 R:-1351.8502 gloss:-35.8958 dloss:35.8958
Episode:240 meanR:-1307.1054 R:-1055.1413 gloss:-35.7818 dloss:35.7818
Episode:241 meanR:-1307.4247 R:-1362.3918 gloss:-36.0994 dloss:36.0994
Episode:242 meanR:-1309.8865 R:-1359.0273 gloss:-35.9712 dloss:35.9712
Episode:243 meanR:-1309.5904 R:-1217.8014 gloss:-36.0090 dloss:36.0090
Episode:244 meanR:-1306.7054 R:-1081.2658 gloss:-36.1018 dloss:36.1018
Episode:245 meanR:-1306.8867 R:-1359.0082 gloss:-36.3253 dloss:36.3253
Episode:246 meanR:-1306.8671 R:-1329.2562 gloss:-36.0866 dloss:36.0866
Episode:247 meanR:-1305.3665 R:-1219.7358 gloss:-36.4698 dloss:36.4698
Episod

Episode:350 meanR:-1319.5676 R:-1355.4385 gloss:-39.0389 dloss:39.0389
Episode:351 meanR:-1321.1687 R:-1319.9971 gloss:-39.6011 dloss:39.6011
Episode:352 meanR:-1317.7713 R:-1033.1478 gloss:-38.8182 dloss:38.8182
Episode:353 meanR:-1317.7254 R:-1345.9282 gloss:-38.9711 dloss:38.9711
Episode:354 meanR:-1317.6596 R:-1365.2032 gloss:-39.0272 dloss:39.0272
Episode:355 meanR:-1317.2532 R:-1327.4365 gloss:-39.5257 dloss:39.5257
Episode:356 meanR:-1316.6691 R:-1332.1894 gloss:-39.4970 dloss:39.4970
Episode:357 meanR:-1320.0005 R:-1337.2902 gloss:-39.2539 dloss:39.2539
Episode:358 meanR:-1320.0441 R:-1346.9842 gloss:-38.8866 dloss:38.8866
Episode:359 meanR:-1322.3076 R:-1337.7218 gloss:-39.7110 dloss:39.7110
Episode:360 meanR:-1324.1460 R:-1330.6913 gloss:-39.4224 dloss:39.4224
Episode:361 meanR:-1323.8298 R:-1330.9008 gloss:-39.5844 dloss:39.5844
Episode:362 meanR:-1323.3418 R:-1331.9439 gloss:-39.1707 dloss:39.1707
Episode:363 meanR:-1320.7631 R:-1431.0451 gloss:-39.7222 dloss:39.7222
Episod

Episode:466 meanR:-1288.4058 R:-1335.5222 gloss:-41.6954 dloss:41.6954
Episode:467 meanR:-1293.5252 R:-1416.0068 gloss:-41.1922 dloss:41.1922
Episode:468 meanR:-1293.5108 R:-1329.7279 gloss:-41.6470 dloss:41.6470
Episode:469 meanR:-1294.2949 R:-1399.4206 gloss:-41.4835 dloss:41.4835
Episode:470 meanR:-1294.0012 R:-1358.2918 gloss:-41.1784 dloss:41.1784
Episode:471 meanR:-1292.4046 R:-889.7927 gloss:-41.6900 dloss:41.6900
Episode:472 meanR:-1292.3007 R:-1337.8226 gloss:-41.0915 dloss:41.0915
Episode:473 meanR:-1291.2651 R:-1239.4622 gloss:-41.4897 dloss:41.4897
Episode:474 meanR:-1290.0301 R:-1061.0355 gloss:-41.4888 dloss:41.4888
Episode:475 meanR:-1289.8185 R:-1354.4326 gloss:-41.8967 dloss:41.8967
Episode:476 meanR:-1291.9776 R:-1378.7278 gloss:-41.6475 dloss:41.6475
Episode:477 meanR:-1294.7999 R:-1307.1992 gloss:-41.5080 dloss:41.5080
Episode:478 meanR:-1294.1358 R:-1350.9068 gloss:-41.8903 dloss:41.8903
Episode:479 meanR:-1293.9983 R:-1348.0552 gloss:-41.8623 dloss:41.8623
Episode

Episode:582 meanR:-1282.1325 R:-1036.6318 gloss:-40.4430 dloss:40.4430
Episode:583 meanR:-1282.4655 R:-1362.4744 gloss:-39.9216 dloss:39.9216
Episode:584 meanR:-1281.7733 R:-1128.1721 gloss:-40.2860 dloss:40.2860
Episode:585 meanR:-1283.1721 R:-1295.1481 gloss:-40.6032 dloss:40.6032
Episode:586 meanR:-1283.1606 R:-1328.1092 gloss:-40.5472 dloss:40.5472
Episode:587 meanR:-1283.5253 R:-1364.8653 gloss:-39.9564 dloss:39.9564
Episode:588 meanR:-1285.1404 R:-1080.0612 gloss:-39.8919 dloss:39.8919
Episode:589 meanR:-1285.1835 R:-1373.0243 gloss:-39.9810 dloss:39.9810
Episode:590 meanR:-1284.8326 R:-1313.4350 gloss:-40.5480 dloss:40.5480
Episode:591 meanR:-1284.9131 R:-1342.3438 gloss:-40.4874 dloss:40.4874
Episode:592 meanR:-1279.9446 R:-1349.7608 gloss:-40.7594 dloss:40.7594
Episode:593 meanR:-1280.0098 R:-1348.7793 gloss:-39.4965 dloss:39.4965
Episode:594 meanR:-1280.0462 R:-1347.2971 gloss:-39.8543 dloss:39.8543
Episode:595 meanR:-1279.9171 R:-1326.3759 gloss:-40.0373 dloss:40.0373
Episod

Episode:698 meanR:-1353.6135 R:-1163.0504 gloss:-40.5851 dloss:40.5851
Episode:699 meanR:-1352.6707 R:-1164.7117 gloss:-40.8243 dloss:40.8243
Episode:700 meanR:-1348.8293 R:-961.0187 gloss:-40.6435 dloss:40.6435
Episode:701 meanR:-1346.9944 R:-1180.5594 gloss:-41.1760 dloss:41.1760
Episode:702 meanR:-1344.4100 R:-1026.7826 gloss:-41.5758 dloss:41.5758
Episode:703 meanR:-1341.5136 R:-1046.1161 gloss:-40.6525 dloss:40.6525
Episode:704 meanR:-1340.2469 R:-1215.0512 gloss:-41.1168 dloss:41.1168
Episode:705 meanR:-1338.4816 R:-1116.7678 gloss:-41.0991 dloss:41.0991
Episode:706 meanR:-1334.4032 R:-1324.2070 gloss:-40.4546 dloss:40.4546
Episode:707 meanR:-1337.6167 R:-1527.0219 gloss:-40.9311 dloss:40.9311
Episode:708 meanR:-1339.2272 R:-1320.9493 gloss:-41.2652 dloss:41.2652
Episode:709 meanR:-1340.1348 R:-1163.4960 gloss:-40.8656 dloss:40.8656
Episode:710 meanR:-1343.0700 R:-1604.8869 gloss:-40.3496 dloss:40.3496
Episode:711 meanR:-1341.6974 R:-1226.7461 gloss:-40.2039 dloss:40.2039
Episode

Episode:814 meanR:-1445.2478 R:-881.5521 gloss:-42.6583 dloss:42.6583
Episode:815 meanR:-1442.1207 R:-1182.1046 gloss:-42.1058 dloss:42.1058
Episode:816 meanR:-1440.6370 R:-1201.5078 gloss:-42.2982 dloss:42.2982
Episode:817 meanR:-1441.0716 R:-1336.2514 gloss:-41.5415 dloss:41.5415
Episode:818 meanR:-1444.1157 R:-1640.6342 gloss:-42.5736 dloss:42.5736
Episode:819 meanR:-1443.8636 R:-1272.4225 gloss:-41.9557 dloss:41.9557
Episode:820 meanR:-1438.9508 R:-1309.3622 gloss:-41.9277 dloss:41.9277
Episode:821 meanR:-1434.1722 R:-1173.6380 gloss:-41.9470 dloss:41.9470
Episode:822 meanR:-1435.0017 R:-1781.4231 gloss:-42.6147 dloss:42.6147
Episode:823 meanR:-1433.9249 R:-1194.7735 gloss:-42.3424 dloss:42.3424
Episode:824 meanR:-1439.7157 R:-1706.3903 gloss:-41.6721 dloss:41.6721
Episode:825 meanR:-1438.9263 R:-1241.6880 gloss:-42.9913 dloss:42.9913
Episode:826 meanR:-1443.6480 R:-1750.8926 gloss:-41.9499 dloss:41.9499
Episode:827 meanR:-1439.2297 R:-1308.4749 gloss:-42.1729 dloss:42.1729
Episode

Episode:930 meanR:-1379.1040 R:-1197.1473 gloss:-43.3916 dloss:43.3916
Episode:931 meanR:-1373.3770 R:-1230.8504 gloss:-43.2055 dloss:43.2055
Episode:932 meanR:-1370.1751 R:-1029.9460 gloss:-43.1879 dloss:43.1879
Episode:933 meanR:-1365.5508 R:-1343.2307 gloss:-43.1584 dloss:43.1584
Episode:934 meanR:-1366.5651 R:-1786.8827 gloss:-43.5320 dloss:43.5320
Episode:935 meanR:-1372.2473 R:-1769.6213 gloss:-43.1859 dloss:43.1859
Episode:936 meanR:-1377.4588 R:-1785.0385 gloss:-42.7629 dloss:42.7629
Episode:937 meanR:-1378.2570 R:-1247.2260 gloss:-43.7365 dloss:43.7365
Episode:938 meanR:-1373.3868 R:-1326.4686 gloss:-43.4384 dloss:43.4384
Episode:939 meanR:-1378.4614 R:-1831.9033 gloss:-43.5243 dloss:43.5243
Episode:940 meanR:-1376.5759 R:-1102.6794 gloss:-43.4310 dloss:43.4310
Episode:941 meanR:-1369.7908 R:-1101.2433 gloss:-43.2023 dloss:43.2023
Episode:942 meanR:-1377.0411 R:-1843.5953 gloss:-43.0465 dloss:43.0465
Episode:943 meanR:-1370.0921 R:-991.0173 gloss:-42.9952 dloss:42.9952
Episode

Episode:1045 meanR:-1306.5295 R:-1049.8873 gloss:-44.3206 dloss:44.3206
Episode:1046 meanR:-1300.8155 R:-1190.1759 gloss:-43.9507 dloss:43.9507
Episode:1047 meanR:-1300.8519 R:-1168.2304 gloss:-43.7695 dloss:43.7695
Episode:1048 meanR:-1292.1142 R:-989.1681 gloss:-44.4037 dloss:44.4037
Episode:1049 meanR:-1285.0391 R:-1155.7541 gloss:-43.7772 dloss:43.7772
Episode:1050 meanR:-1290.2056 R:-1769.0414 gloss:-43.7591 dloss:43.7591
Episode:1051 meanR:-1297.3955 R:-1856.7997 gloss:-43.9201 dloss:43.9201
Episode:1052 meanR:-1296.9364 R:-1170.7131 gloss:-43.7494 dloss:43.7494
Episode:1053 meanR:-1302.1292 R:-1810.0034 gloss:-43.9686 dloss:43.9686
Episode:1054 meanR:-1296.3705 R:-1173.6280 gloss:-43.5921 dloss:43.5921
Episode:1055 meanR:-1297.0721 R:-1147.1186 gloss:-43.9826 dloss:43.9826
Episode:1056 meanR:-1297.3710 R:-1827.0921 gloss:-43.9385 dloss:43.9385
Episode:1057 meanR:-1297.5331 R:-1186.7626 gloss:-43.8505 dloss:43.8505
Episode:1058 meanR:-1300.3762 R:-1338.5841 gloss:-43.5008 dloss:4

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(gloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(dloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [31]:
import gym
env = gym.make('BipedalWalker-v2')

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model2.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(1):
        state = env.reset()
        total_reward = 0

        # Steps/batches
        while True:
            env.render()
            actions_pred = sess.run(model.actions_pred, feed_dict={model.states: state.reshape([1, -1])})
            action = action_pred.reshape([-1])
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                print('total_reward: {}'.format(total_reward))
                break
                
env.close()

INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt
total_reward: -104.69566602519279


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.