# Deep cortical reinforcement learning: Policy gradients + Q-learning + GAN


In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

##### >**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym

## Create the Cart-Pole game environment
# env = gym.make('CartPole-v0')
# env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('MountainCarContinuous-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

# Discrete/int or continuos/float
env.action_space.dtype

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




dtype('float32')

We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
env.observation_space

Box(24,)

In [4]:
state = env.reset()
batch = []
for _ in range(1111):
    #env.render()
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action) # take a random action
    batch.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done:
        state = env.reset()

To shut the window showing the simulation, use `env.close()`.

In [5]:
# env.close()

If you ran the simulation above, we can look at the rewards:

In [6]:
batch[0], 
batch[0][1].shape, state.shape

((4,), (24,))

In [7]:
import numpy as np
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
infos = np.array([each[4] for each in batch])

In [8]:
# print(rewards[-20:])
print('shapes:', np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print('dtypes:', np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

shapes: (1111, 24) (1111, 4) (1111, 24) (1111,)
dtypes: float64 float32 float64 float64
states: 0.99992806 -0.9998551
actions: 2.445514678955078 -2.266592502593994
rewards: 2.445514678955078 -2.266592502593994


In [9]:
actions[:10]

array([[ 2.74703396e-03,  3.15252488e-06, -4.11065966e-04,
        -1.60000205e-02,  9.22507718e-02,  9.54264484e-04,
         8.60050470e-01,  6.44485272e-04,  1.00000000e+00,
         3.26296426e-02,  9.54228162e-04,  8.53657931e-01,
        -6.49294195e-04,  1.00000000e+00,  4.40813839e-01,
         4.45819944e-01,  4.61422592e-01,  4.89549994e-01,
         5.34102559e-01,  6.02460802e-01,  7.09148586e-01,
         8.85931492e-01,  1.00000000e+00,  1.00000000e+00],
       [ 2.46125250e-03, -6.90751910e-03,  5.26649058e-03,
         1.96621442e-02, -2.90470362e-01, -7.09619224e-01,
         1.47205219e+00,  9.93585507e-01,  1.00000000e+00,
         3.00898015e-01, -1.58208907e-02,  1.63350701e-01,
         3.32505027e-01,  1.00000000e+00,  4.52898622e-01,
         4.58041966e-01,  4.74072367e-01,  5.02970874e-01,
         5.48744857e-01,  6.18977070e-01,  7.28589714e-01,
         9.10219014e-01,  1.00000000e+00,  1.00000000e+00],
       [ 2.56516994e-03,  2.57482082e-03,  7.42817447e

In [10]:
rewards[:10]

array([[ 2.46125250e-03, -6.90751910e-03,  5.26649058e-03,
         1.96621442e-02, -2.90470362e-01, -7.09619224e-01,
         1.47205219e+00,  9.93585507e-01,  1.00000000e+00,
         3.00898015e-01, -1.58208907e-02,  1.63350701e-01,
         3.32505027e-01,  1.00000000e+00,  4.52898622e-01,
         4.58041966e-01,  4.74072367e-01,  5.02970874e-01,
         5.48744857e-01,  6.18977070e-01,  7.28589714e-01,
         9.10219014e-01,  1.00000000e+00,  1.00000000e+00],
       [ 2.56516994e-03,  2.57482082e-03,  7.42817447e-03,
         6.84252322e-03, -1.35565288e-02, -3.95231038e-01,
         1.08347261e+00,  0.00000000e+00,  1.00000000e+00,
         2.60842085e-01, -4.16364878e-01,  2.69976616e-01,
         2.41971036e-01,  1.00000000e+00,  4.56012309e-01,
         4.61191028e-01,  4.77331609e-01,  5.06428778e-01,
         5.52517474e-01,  6.23232543e-01,  7.33598769e-01,
         9.16476786e-01,  1.00000000e+00,  1.00000000e+00],
       [-2.49468628e-02, -5.56278229e-02, -3.40317643e

In [11]:
# import numpy as np
def sigmoid(x, derivative=False):
  return x*(1-x) if derivative else 1/(1+np.exp(-x))

In [12]:
sigmoid(np.max(np.array(rewards))), sigmoid(np.min(np.array(rewards)))

(0.9202328286156476, 0.09392780759037227)

In [13]:
print('rewards:', np.max(np.array(rewards))/100, np.min(np.array(rewards))/100)

rewards: 0.024455146789550783 -0.02266592502593994


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [35]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.float32, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    return states, actions, targetQs

In [36]:
# How to use batch-norm
#   x_norm = tf.layers.batch_normalization(x, training=training)

#   # ...

#   update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
#   with tf.control_dependencies(update_ops):
#     train_op = optimizer.minimize(loss)

In [37]:
# training: Either a Python boolean, or a TensorFlow boolean scalar tensor (e.g. a placeholder). 
# Whether to return the output in: 
# training mode (normalized with statistics of the current batch) or 
# inference mode (normalized with moving statistics). 
# NOTE: make sure to set this parameter correctly, or else your training/inference will not work properly.

In [38]:
# Generator/Controller: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [39]:
# Discriminator/Dopamine: Reward function/planner/naviator/advisor/supervisor/cortical columns
def discriminator(states, actions, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusion/merge states and actions/ SA/ SM
        x_fused = tf.concat(axis=1, values=[states, actions])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        #predictions = tf.nn.softmax(logits)

        # return rewards logits
        return logits

In [40]:
batch[1][1], batch[2][1], np.square(batch[1][1] - batch[2][1])

(array([ 0.33061767,  0.5409548 ,  0.83249   , -0.9815165 ], dtype=float32),
 array([-0.5404586 , -0.54526246,  0.5219986 ,  0.6396351 ], dtype=float32),
 array([0.7587739 , 1.179868  , 0.09640493, 2.6281323 ], dtype=float32))

In [43]:
def model_loss(action_size, hidden_size, states, actions, targetQs):
    # G
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    #actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    actions_labels = tf.nn.sigmoid(actions)
    # neg_log_prob_actions = tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
    #                                                                   labels=actions_labels)
    neg_log_prob_actions = tf.nn.sigmoid_cross_entropy_with_logits(logits=actions_logits, 
                                                                   labels=actions_labels)
    #g_loss = tf.reduce_mean(neg_log_prob_actions * targetQs) # error!
    
    # D
    Qs = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states)
    d_loss = tf.reduce_mean(tf.square(Qs - targetQs))
    g_loss = tf.reduce_mean(neg_log_prob_actions * Qs)
    return actions_logits, Qs, g_loss, d_loss

In [44]:
# Optimizating/training/learning G & D
def model_opt(g_loss, d_loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return g_opt, d_opt

In [45]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs = model_input(state_size=state_size, action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.Qs_logits, self.g_loss, self.d_loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, targetQs=self.targetQs) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss, d_loss=self.d_loss, learning_rate=learning_rate)

In [46]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [47]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(1000, 24) actions:(1000, 4)
action size:371.38470458984375


In [48]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
# state_size = 37
# state_size_ = (84, 84, 3)
state_size = 24
action_size = 4
hidden_size = 24*2             # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
gamma = 0.99                   # future reward discount
memory_size = 1000            # memory capacity
batch_size = 1000             # experience mini-batch size

In [49]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [50]:
state = env.reset()
for _ in range(memory_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, gloss_list, dloss_list = [], [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        gloss_batch, dloss_batch = [], []
        state = env.reset()

        # Training steps/batches
        while True:
            # Explore (Env) or Exploit (Model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                action = env.action_space.sample()
            else:
                action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
                action = np.reshape(action_logits, [-1]) # For continuous action space
                #action = np.argmax(action_logits) # For discrete action space
#             print(action.shape)
            next_state, reward, done, _ = env.step(action)
            #print(action.shape, next_state.shape, reward, done)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            total_reward += reward
            state = next_state

            # Training
            #batch = memory.sample(batch_size)
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
#             print(actions.shape)
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            nextQs_logits = sess.run(model.Qs_logits, feed_dict = {model.states: next_states})
            #nextQs = np.max(nextQs_logits, axis=1) * (1-dones)
            nextQs = nextQs_logits.reshape([-1]) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
#             print(targetQs.shape)
            gloss, dloss, _, _ = sess.run([model.g_loss, model.d_loss, model.g_opt, model.d_opt],
                                            feed_dict = {model.states: states, 
                                                         model.actions: actions,
                                                         model.targetQs: targetQs})
            gloss_batch.append(gloss)
            dloss_batch.append(dloss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)),
              'exploreP:{:.4f}'.format(explore_p))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) <= -150:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model2.ckpt')

Episode:0 meanR:-114.8592 R:-114.85924937840878 gloss:-0.1710 dloss:10.1072 exploreP:0.9917
Episode:1 meanR:-108.6199 R:-102.38057620347281 gloss:-168.8777 dloss:9.7398 exploreP:0.8466
Episode:2 meanR:-106.4612 R:-102.14365668013754 gloss:-192.7809 dloss:0.2265 exploreP:0.8423
Episode:3 meanR:-105.9194 R:-104.29425493636495 gloss:-322.3089 dloss:8.8371 exploreP:0.8379
Episode:4 meanR:-104.6380 R:-99.5121267642888 gloss:-530.3848 dloss:16.1615 exploreP:0.8299
Episode:5 meanR:-108.2965 R:-126.58929302986353 gloss:-898.6055 dloss:22.1505 exploreP:0.8124
Episode:6 meanR:-107.4127 R:-102.10977879420327 gloss:-1227.1403 dloss:26.8314 exploreP:0.8049
Episode:7 meanR:-106.5047 R:-100.14898194273127 gloss:-1560.2084 dloss:30.1961 exploreP:0.7985
Episode:8 meanR:-105.8458 R:-100.57420357320645 gloss:-1874.5610 dloss:32.5871 exploreP:0.7924
Episode:9 meanR:-106.1337 R:-108.72536417224816 gloss:-2695.7964 dloss:32.8745 exploreP:0.7785
Episode:10 meanR:-113.9604 R:-192.22689733556513 gloss:-3952.46

Episode:86 meanR:-120.4995 R:-116.07647661913839 gloss:-10345.1719 dloss:19.0645 exploreP:0.0679
Episode:87 meanR:-120.4476 R:-115.92631605460558 gloss:-10892.2275 dloss:19.0780 exploreP:0.0672
Episode:88 meanR:-120.3965 R:-115.89721876847497 gloss:-11938.5029 dloss:19.1622 exploreP:0.0666
Episode:89 meanR:-120.3703 R:-118.04298236798371 gloss:-13254.1963 dloss:19.2267 exploreP:0.0663
Episode:90 meanR:-120.3513 R:-118.6430174836839 gloss:-13442.5225 dloss:19.6240 exploreP:0.0656
Episode:91 meanR:-120.3287 R:-118.26627160261074 gloss:-13646.7520 dloss:18.5089 exploreP:0.0649
Episode:92 meanR:-120.3529 R:-122.58163401370558 gloss:-14017.4375 dloss:21.3557 exploreP:0.0647
Episode:93 meanR:-120.3057 R:-115.91594035798437 gloss:-12668.5947 dloss:18.4383 exploreP:0.0640
Episode:94 meanR:-120.2591 R:-115.87749728827708 gloss:-11431.5420 dloss:18.5395 exploreP:0.0632
Episode:95 meanR:-120.3125 R:-125.39153566696639 gloss:-11428.4512 dloss:23.0321 exploreP:0.0619
Episode:96 meanR:-120.2935 R:-1

Episode:171 meanR:-121.8335 R:-123.56139495399906 gloss:-11883.5381 dloss:5.0000 exploreP:0.0441
Episode:172 meanR:-121.7854 R:-116.10234185767112 gloss:-12110.9902 dloss:5.8873 exploreP:0.0439
Episode:173 meanR:-121.7805 R:-123.62642156710662 gloss:-11697.9814 dloss:10.8934 exploreP:0.0437
Episode:174 meanR:-121.8421 R:-123.62452777752156 gloss:-11267.7158 dloss:8.0230 exploreP:0.0436
Episode:175 meanR:-121.9222 R:-123.63047219251717 gloss:-10312.0703 dloss:7.7431 exploreP:0.0434
Episode:176 meanR:-121.9337 R:-123.50465409247703 gloss:-9557.1895 dloss:7.5839 exploreP:0.0433
Episode:177 meanR:-121.9862 R:-123.63631939356155 gloss:-8970.1553 dloss:7.4371 exploreP:0.0432
Episode:178 meanR:-122.0563 R:-123.54884225122011 gloss:-8687.5605 dloss:7.2992 exploreP:0.0430
Episode:179 meanR:-122.1283 R:-123.41297956774696 gloss:-8896.3486 dloss:7.2485 exploreP:0.0429
Episode:180 meanR:-122.1889 R:-123.53984174150663 gloss:-9086.1426 dloss:7.1833 exploreP:0.0428
Episode:181 meanR:-122.2889 R:-122

Episode:257 meanR:-123.2238 R:-178.7948789716949 gloss:-16615.2617 dloss:11.4533 exploreP:0.0292
Episode:258 meanR:-123.7839 R:-179.10370512496573 gloss:-1005.1916 dloss:1.4612 exploreP:0.0264
Episode:259 meanR:-124.3543 R:-179.98603745237307 gloss:-747.8523 dloss:0.4610 exploreP:0.0240
Episode:260 meanR:-124.9155 R:-179.54149274663033 gloss:-753.5580 dloss:1.2297 exploreP:0.0219
Episode:261 meanR:-125.4804 R:-180.04277638339596 gloss:-719.3569 dloss:0.5454 exploreP:0.0202
Episode:262 meanR:-126.0517 R:-180.57724750432695 gloss:-512.6987 dloss:0.4104 exploreP:0.0186


# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(gloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(dloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [36]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episodes/epochs
    for _ in range(1):
        state = env.reset()
        total_reward = 0

        # Steps/batches
        while True:
            env.render()
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                print('total_reward: {}'.format(total_reward))
                break
                
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward: 500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.