# Sequential Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import gym
import numpy as np

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

>**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [3]:
import gym

# Create the Cart-Pole game environment
#env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [4]:
env.reset()
for _ in range(10):
    # env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

state, action, reward, done, info: [ 0.01929091 -0.17009285 -0.00067668  0.25679981] 0 1.0 False {}
state, action, reward, done, info: [ 0.01588905  0.02503876  0.00445931 -0.03609647] 1 1.0 False {}
state, action, reward, done, info: [ 0.01638983  0.22009648  0.00373738 -0.32736911] 1 1.0 False {}
state, action, reward, done, info: [ 0.02079176  0.02492152 -0.00281    -0.03350993] 0 1.0 False {}
state, action, reward, done, info: [ 0.02129019  0.22008365 -0.0034802  -0.32707811] 1 1.0 False {}
state, action, reward, done, info: [ 0.02569186  0.41525498 -0.01002176 -0.6208565 ] 1 1.0 False {}
state, action, reward, done, info: [ 0.03399696  0.61051545 -0.02243889 -0.91667884] 1 1.0 False {}
state, action, reward, done, info: [ 0.04620727  0.8059335  -0.04077247 -1.21632861] 1 1.0 False {}
state, action, reward, done, info: [ 0.06232594  1.00155693 -0.06509904 -1.52150367] 1 1.0 False {}
state, action, reward, done, info: [ 0.08235708  1.19740229 -0.09552911 -1.83377506] 1 1.0 False {}


To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [5]:
# print(rewards[-20:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [8]:
def model_input(state_size, lstm_size, batch_size=1):
    actions = tf.placeholder(tf.int32, [None], name='actions')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
        
    gru = tf.nn.rnn_cell.GRUCell(lstm_size)
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    return actions, states, targetQs, cell, initial_state

In [9]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [10]:
def model_loss(action_size, hidden_size, states, cell, initial_state, actions, targetQs):
    actions_logits, final_state = generator(states=states, cell=cell, initial_state=initial_state, 
                                            lstm_size=hidden_size, num_classes=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    loss = tf.reduce_mean(tf.square(Qs - targetQs)) # next state, next action and nextQs
    #lossQlbl = tf.reduce_mean(tf.square(Qs - labelQs)) # current state, action, and currentQs
    # lossQtgt_sigm = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs, 
    #                                                                        labels=tf.nn.sigmoid(targetQs)))
    # lossQlbl_sigm = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs,
    #                                                                        labels=tf.nn.sigmoid(labelQs)))
    #
    #loss = lossQtgt + lossQlbl #+ lossQtgt_sigm + lossQlbl_sigm
    return actions_logits, final_state, loss #, lossQtgt, lossQlbl, lossQtgt_sigm, lossQlbl_sigm

In [11]:
def model_opt(loss, learning_rate):
    """
    Get optimization operations in order
    :param loss: Generator loss Tensor for action prediction
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    grads = tf.gradients(loss, g_vars)
    opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [12]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.actions, self.states, self.targetQs, cell, self.initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cell=cell, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [13]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [14]:
# print('state:', np.array(states).shape[1], 
#       'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

In [17]:
# Training parameters
batch_size = 500               # number of samples in the memory/ experience as mini-batch size
learning_rate = 0.001          # learning rate for adam

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
action_size = 2                # number of units for the output actions -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation

In [18]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 2)


## Populate the memory (exprience memory)

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [19]:
state = env.reset()
for _ in range(batch_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done is True:
        state = env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [20]:
memory.buffer[0]

[array([ 0.03727844, -0.00024793, -0.00543765, -0.04855665]),
 1,
 array([ 0.03727348,  0.19495156, -0.00640878, -0.34295023]),
 1.0,
 0.0]

In [21]:
# states, rewards, actions

In [None]:
from collections import deque
episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
saver = tf.train.Saver()
rewards_list, loss_list = [], []

# TF session for training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    
    # Training episodes/epochs
    for ep in range(111111111111111):
        state = env.reset()
        total_reward = 0
        loss_batch = []
        initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]

        # Training steps/batches
        while True:
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            # Explore (Env) or Exploit (Model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                action = env.action_space.sample()
            else:
                action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append([initial_state, final_state])
            total_reward += reward
            initial_state = final_state
            state = next_state

            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            rnn_states = memory.states
            initial_states = np.array([each[0] for each in rnn_states])
            final_states = np.array([each[1] for each in rnn_states])
            next_actions_logits = sess.run(model.actions_logits, 
                                           feed_dict = {model.states: next_states, 
                                                        model.initial_state: final_states[0].reshape([1, -1])})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones) # exploit
            targetQs = rewards + (0.99 * nextQs)
            loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                     model.actions: actions,
                                                                     model.targetQs: targetQs,
                                                        model.initial_state: initial_states[0].reshape([1, -1])})
            loss_batch.append(loss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode: {}'.format(ep),
              'meanReward: {:.4f}'.format(np.mean(episode_reward)),
              'meanLoss: {:.4f}'.format(np.mean(loss_batch)))
        rewards_list.append([ep, np.mean(episode_reward)])
        loss_list.append([ep, np.mean(loss_batch)])
        if(np.mean(episode_reward) >= 500):
            break
    
    saver.save(sess, 'checkpoints/model5.ckpt')

Episode: 0 meanReward: 27.0000 meanLoss: 2.7003
Episode: 1 meanReward: 23.0000 meanLoss: 3.8974
Episode: 2 meanReward: 25.3333 meanLoss: 4.1154
Episode: 3 meanReward: 22.2500 meanLoss: 5.2078
Episode: 4 meanReward: 23.4000 meanLoss: 6.2932
Episode: 5 meanReward: 25.3333 meanLoss: 7.3700
Episode: 6 meanReward: 29.4286 meanLoss: 11.2740
Episode: 7 meanReward: 27.8750 meanLoss: 6.5143
Episode: 8 meanReward: 27.7778 meanLoss: 6.5436
Episode: 9 meanReward: 27.4000 meanLoss: 7.0703
Episode: 10 meanReward: 27.2727 meanLoss: 7.6783
Episode: 11 meanReward: 26.0833 meanLoss: 8.2082
Episode: 12 meanReward: 27.0769 meanLoss: 7.7172
Episode: 13 meanReward: 26.5714 meanLoss: 8.3219
Episode: 14 meanReward: 26.6667 meanLoss: 8.5452
Episode: 15 meanReward: 26.2500 meanLoss: 8.8593
Episode: 16 meanReward: 25.8235 meanLoss: 9.1318
Episode: 17 meanReward: 24.8889 meanLoss: 8.8313
Episode: 18 meanReward: 24.3158 meanLoss: 8.8264
Episode: 19 meanReward: 24.3500 meanLoss: 8.1634
Episode: 20 meanReward: 24.00

Episode: 164 meanReward: 37.4300 meanLoss: 10.9025
Episode: 165 meanReward: 37.4300 meanLoss: 11.1779
Episode: 166 meanReward: 37.3600 meanLoss: 23.4601
Episode: 167 meanReward: 37.0700 meanLoss: 42.1241
Episode: 168 meanReward: 37.0100 meanLoss: 39.7930
Episode: 169 meanReward: 37.1900 meanLoss: 115.7931
Episode: 170 meanReward: 37.1600 meanLoss: 56.7702
Episode: 171 meanReward: 37.0200 meanLoss: 47.4981
Episode: 172 meanReward: 37.1300 meanLoss: 29.6839
Episode: 173 meanReward: 36.8200 meanLoss: 23.9989
Episode: 174 meanReward: 37.2400 meanLoss: 28.0243
Episode: 175 meanReward: 37.1500 meanLoss: 44.2890
Episode: 176 meanReward: 37.6200 meanLoss: 106.6879
Episode: 177 meanReward: 37.4200 meanLoss: 129.5124
Episode: 178 meanReward: 37.4300 meanLoss: 88.4397
Episode: 179 meanReward: 37.5800 meanLoss: 82.8375
Episode: 180 meanReward: 37.8000 meanLoss: 49.3479
Episode: 181 meanReward: 37.9800 meanLoss: 33.4321
Episode: 182 meanReward: 38.6800 meanLoss: 19.9368
Episode: 183 meanReward: 39.

Episode: 323 meanReward: 253.9400 meanLoss: 29.2325
Episode: 324 meanReward: 258.5500 meanLoss: 5.3311
Episode: 325 meanReward: 262.7400 meanLoss: 18.8793
Episode: 326 meanReward: 266.8900 meanLoss: 17.3563
Episode: 327 meanReward: 267.3800 meanLoss: 25.6535
Episode: 328 meanReward: 271.6200 meanLoss: 21.7112
Episode: 329 meanReward: 276.4600 meanLoss: 10.8464
Episode: 330 meanReward: 275.5400 meanLoss: 17.7698
Episode: 331 meanReward: 274.5900 meanLoss: 27.6458
Episode: 332 meanReward: 271.4800 meanLoss: 55.1073
Episode: 333 meanReward: 269.7100 meanLoss: 42.3936
Episode: 334 meanReward: 270.1200 meanLoss: 16.6931
Episode: 335 meanReward: 274.5300 meanLoss: 3.5721
Episode: 336 meanReward: 275.3700 meanLoss: 21.6141
Episode: 337 meanReward: 279.6000 meanLoss: 17.0073
Episode: 338 meanReward: 280.3600 meanLoss: 27.1521
Episode: 339 meanReward: 277.9800 meanLoss: 50.9190
Episode: 340 meanReward: 277.0200 meanLoss: 31.9513
Episode: 341 meanReward: 279.7700 meanLoss: 2.2509
Episode: 342 me

Episode: 482 meanReward: 228.9400 meanLoss: 42.5381
Episode: 483 meanReward: 230.6800 meanLoss: 40.0722
Episode: 484 meanReward: 229.3600 meanLoss: 8.8449
Episode: 485 meanReward: 229.3600 meanLoss: 4.5405
Episode: 486 meanReward: 224.7700 meanLoss: 88.5103
Episode: 487 meanReward: 222.8000 meanLoss: 134.0231
Episode: 488 meanReward: 222.1800 meanLoss: 114.9389
Episode: 489 meanReward: 225.5800 meanLoss: 94.1833
Episode: 490 meanReward: 225.7200 meanLoss: 62.7692
Episode: 491 meanReward: 225.7200 meanLoss: 56.3123
Episode: 492 meanReward: 222.6700 meanLoss: 52.8411
Episode: 493 meanReward: 223.5700 meanLoss: 67.1539
Episode: 494 meanReward: 222.1800 meanLoss: 82.4084
Episode: 495 meanReward: 223.0700 meanLoss: 81.9304
Episode: 496 meanReward: 223.1900 meanLoss: 44.4779
Episode: 497 meanReward: 224.2300 meanLoss: 38.6542
Episode: 498 meanReward: 227.3900 meanLoss: 22.3783
Episode: 499 meanReward: 228.5700 meanLoss: 16.0243
Episode: 500 meanReward: 228.7300 meanLoss: 14.0791
Episode: 501

Episode: 640 meanReward: 287.5800 meanLoss: 22.1436
Episode: 641 meanReward: 287.8100 meanLoss: 24.6294
Episode: 642 meanReward: 284.8000 meanLoss: 22.8247
Episode: 643 meanReward: 289.3700 meanLoss: 12.9288
Episode: 644 meanReward: 284.5100 meanLoss: 17.6906
Episode: 645 meanReward: 281.9100 meanLoss: 32.7379
Episode: 646 meanReward: 277.1300 meanLoss: 42.4118
Episode: 647 meanReward: 272.2900 meanLoss: 45.6113
Episode: 648 meanReward: 267.4500 meanLoss: 46.4828
Episode: 649 meanReward: 262.9300 meanLoss: 51.0276
Episode: 650 meanReward: 262.7000 meanLoss: 30.4103
Episode: 651 meanReward: 266.6500 meanLoss: 11.5001
Episode: 652 meanReward: 270.6300 meanLoss: 31.9112
Episode: 653 meanReward: 273.5900 meanLoss: 30.6991
Episode: 654 meanReward: 273.5900 meanLoss: 18.1860
Episode: 655 meanReward: 271.5200 meanLoss: 24.9300
Episode: 656 meanReward: 276.3600 meanLoss: 17.3187
Episode: 657 meanReward: 279.7500 meanLoss: 27.7667
Episode: 658 meanReward: 279.7900 meanLoss: 36.8171
Episode: 659

Episode: 798 meanReward: 241.7700 meanLoss: 44.3381
Episode: 799 meanReward: 242.4000 meanLoss: 22.5120
Episode: 800 meanReward: 242.4200 meanLoss: 28.0230
Episode: 801 meanReward: 245.0700 meanLoss: 26.5374
Episode: 802 meanReward: 244.2000 meanLoss: 9.0008
Episode: 803 meanReward: 243.5200 meanLoss: 24.7019
Episode: 804 meanReward: 241.9400 meanLoss: 50.8751
Episode: 805 meanReward: 239.1600 meanLoss: 32.3422
Episode: 806 meanReward: 236.7800 meanLoss: 16.0865
Episode: 807 meanReward: 236.7800 meanLoss: 6.2293
Episode: 808 meanReward: 236.7800 meanLoss: 18.2261
Episode: 809 meanReward: 239.7900 meanLoss: 12.1835
Episode: 810 meanReward: 239.0600 meanLoss: 24.9434
Episode: 811 meanReward: 241.0700 meanLoss: 13.5874
Episode: 812 meanReward: 240.9500 meanLoss: 4.0647
Episode: 813 meanReward: 242.1300 meanLoss: 13.0022
Episode: 814 meanReward: 242.2700 meanLoss: 21.4777
Episode: 815 meanReward: 242.2000 meanLoss: 29.7519
Episode: 816 meanReward: 241.0500 meanLoss: 21.5948
Episode: 817 me

Episode: 956 meanReward: 290.6000 meanLoss: 17.3262
Episode: 957 meanReward: 288.5700 meanLoss: 6.0848
Episode: 958 meanReward: 291.7300 meanLoss: 5.0737
Episode: 959 meanReward: 289.1300 meanLoss: 41.4692
Episode: 960 meanReward: 285.0100 meanLoss: 56.1620
Episode: 961 meanReward: 287.0600 meanLoss: 56.2974
Episode: 962 meanReward: 284.4200 meanLoss: 39.3163
Episode: 963 meanReward: 281.7300 meanLoss: 62.1218
Episode: 964 meanReward: 280.0100 meanLoss: 70.8101
Episode: 965 meanReward: 278.7500 meanLoss: 121.9094
Episode: 966 meanReward: 277.4300 meanLoss: 123.2042
Episode: 967 meanReward: 277.4100 meanLoss: 114.8508
Episode: 968 meanReward: 277.3700 meanLoss: 116.5893
Episode: 969 meanReward: 277.3700 meanLoss: 111.5237
Episode: 970 meanReward: 275.1300 meanLoss: 120.9518
Episode: 971 meanReward: 270.2800 meanLoss: 129.8226
Episode: 972 meanReward: 267.7400 meanLoss: 117.8223
Episode: 973 meanReward: 265.4500 meanLoss: 105.8777
Episode: 974 meanReward: 262.3100 meanLoss: 107.7443
Epis

Episode: 1123 meanReward: 9.2700 meanLoss: nan
Episode: 1124 meanReward: 9.2900 meanLoss: nan
Episode: 1125 meanReward: 9.3000 meanLoss: nan
Episode: 1126 meanReward: 9.3100 meanLoss: nan
Episode: 1127 meanReward: 9.3100 meanLoss: nan
Episode: 1128 meanReward: 9.3200 meanLoss: nan
Episode: 1129 meanReward: 9.3100 meanLoss: nan
Episode: 1130 meanReward: 9.3100 meanLoss: nan
Episode: 1131 meanReward: 9.3200 meanLoss: nan
Episode: 1132 meanReward: 9.3300 meanLoss: nan
Episode: 1133 meanReward: 9.3300 meanLoss: nan
Episode: 1134 meanReward: 9.3200 meanLoss: nan
Episode: 1135 meanReward: 9.3300 meanLoss: nan
Episode: 1136 meanReward: 9.3400 meanLoss: nan
Episode: 1137 meanReward: 9.3400 meanLoss: nan
Episode: 1138 meanReward: 9.3600 meanLoss: nan
Episode: 1139 meanReward: 9.3500 meanLoss: nan
Episode: 1140 meanReward: 9.3600 meanLoss: nan
Episode: 1141 meanReward: 9.3600 meanLoss: nan
Episode: 1142 meanReward: 9.3400 meanLoss: nan
Episode: 1143 meanReward: 9.3300 meanLoss: nan
Episode: 1144

Episode: 1298 meanReward: 9.4000 meanLoss: nan
Episode: 1299 meanReward: 9.3900 meanLoss: nan
Episode: 1300 meanReward: 9.4000 meanLoss: nan
Episode: 1301 meanReward: 9.4100 meanLoss: nan
Episode: 1302 meanReward: 9.4300 meanLoss: nan
Episode: 1303 meanReward: 9.4200 meanLoss: nan
Episode: 1304 meanReward: 9.4300 meanLoss: nan
Episode: 1305 meanReward: 9.4200 meanLoss: nan
Episode: 1306 meanReward: 9.4200 meanLoss: nan
Episode: 1307 meanReward: 9.4300 meanLoss: nan
Episode: 1308 meanReward: 9.4400 meanLoss: nan
Episode: 1309 meanReward: 9.4300 meanLoss: nan
Episode: 1310 meanReward: 9.4300 meanLoss: nan
Episode: 1311 meanReward: 9.4300 meanLoss: nan
Episode: 1312 meanReward: 9.4200 meanLoss: nan
Episode: 1313 meanReward: 9.4400 meanLoss: nan
Episode: 1314 meanReward: 9.4500 meanLoss: nan
Episode: 1315 meanReward: 9.4500 meanLoss: nan
Episode: 1316 meanReward: 9.4500 meanLoss: nan
Episode: 1317 meanReward: 9.4600 meanLoss: nan
Episode: 1318 meanReward: 9.4600 meanLoss: nan
Episode: 1319

Episode: 1473 meanReward: 9.4000 meanLoss: nan
Episode: 1474 meanReward: 9.4200 meanLoss: nan
Episode: 1475 meanReward: 9.4200 meanLoss: nan
Episode: 1476 meanReward: 9.4200 meanLoss: nan
Episode: 1477 meanReward: 9.4300 meanLoss: nan
Episode: 1478 meanReward: 9.4400 meanLoss: nan
Episode: 1479 meanReward: 9.4200 meanLoss: nan
Episode: 1480 meanReward: 9.4100 meanLoss: nan
Episode: 1481 meanReward: 9.4000 meanLoss: nan
Episode: 1482 meanReward: 9.4000 meanLoss: nan
Episode: 1483 meanReward: 9.4000 meanLoss: nan
Episode: 1484 meanReward: 9.3900 meanLoss: nan
Episode: 1485 meanReward: 9.3800 meanLoss: nan
Episode: 1486 meanReward: 9.3700 meanLoss: nan
Episode: 1487 meanReward: 9.3700 meanLoss: nan
Episode: 1488 meanReward: 9.3700 meanLoss: nan
Episode: 1489 meanReward: 9.3700 meanLoss: nan
Episode: 1490 meanReward: 9.3700 meanLoss: nan
Episode: 1491 meanReward: 9.3700 meanLoss: nan
Episode: 1492 meanReward: 9.3800 meanLoss: nan
Episode: 1493 meanReward: 9.4000 meanLoss: nan
Episode: 1494

Episode: 1648 meanReward: 9.3100 meanLoss: nan
Episode: 1649 meanReward: 9.3200 meanLoss: nan
Episode: 1650 meanReward: 9.3300 meanLoss: nan
Episode: 1651 meanReward: 9.3400 meanLoss: nan
Episode: 1652 meanReward: 9.3400 meanLoss: nan
Episode: 1653 meanReward: 9.3400 meanLoss: nan
Episode: 1654 meanReward: 9.3500 meanLoss: nan
Episode: 1655 meanReward: 9.3300 meanLoss: nan
Episode: 1656 meanReward: 9.3400 meanLoss: nan
Episode: 1657 meanReward: 9.3200 meanLoss: nan
Episode: 1658 meanReward: 9.3200 meanLoss: nan
Episode: 1659 meanReward: 9.3200 meanLoss: nan
Episode: 1660 meanReward: 9.3300 meanLoss: nan
Episode: 1661 meanReward: 9.3200 meanLoss: nan
Episode: 1662 meanReward: 9.3000 meanLoss: nan
Episode: 1663 meanReward: 9.3000 meanLoss: nan
Episode: 1664 meanReward: 9.3000 meanLoss: nan
Episode: 1665 meanReward: 9.2800 meanLoss: nan
Episode: 1666 meanReward: 9.3000 meanLoss: nan
Episode: 1667 meanReward: 9.3100 meanLoss: nan
Episode: 1668 meanReward: 9.3100 meanLoss: nan
Episode: 1669

Episode: 1823 meanReward: 9.3300 meanLoss: nan
Episode: 1824 meanReward: 9.3200 meanLoss: nan
Episode: 1825 meanReward: 9.3000 meanLoss: nan
Episode: 1826 meanReward: 9.2900 meanLoss: nan
Episode: 1827 meanReward: 9.2800 meanLoss: nan
Episode: 1828 meanReward: 9.2700 meanLoss: nan
Episode: 1829 meanReward: 9.2700 meanLoss: nan
Episode: 1830 meanReward: 9.2800 meanLoss: nan
Episode: 1831 meanReward: 9.2800 meanLoss: nan
Episode: 1832 meanReward: 9.3000 meanLoss: nan
Episode: 1833 meanReward: 9.3100 meanLoss: nan
Episode: 1834 meanReward: 9.3100 meanLoss: nan
Episode: 1835 meanReward: 9.3000 meanLoss: nan
Episode: 1836 meanReward: 9.3000 meanLoss: nan
Episode: 1837 meanReward: 9.3100 meanLoss: nan
Episode: 1838 meanReward: 9.3100 meanLoss: nan
Episode: 1839 meanReward: 9.3200 meanLoss: nan
Episode: 1840 meanReward: 9.3100 meanLoss: nan
Episode: 1841 meanReward: 9.3100 meanLoss: nan
Episode: 1842 meanReward: 9.2900 meanLoss: nan
Episode: 1843 meanReward: 9.2500 meanLoss: nan
Episode: 1844

Episode: 1998 meanReward: 9.4700 meanLoss: nan
Episode: 1999 meanReward: 9.4800 meanLoss: nan
Episode: 2000 meanReward: 9.4700 meanLoss: nan
Episode: 2001 meanReward: 9.4600 meanLoss: nan
Episode: 2002 meanReward: 9.4600 meanLoss: nan
Episode: 2003 meanReward: 9.4600 meanLoss: nan
Episode: 2004 meanReward: 9.4700 meanLoss: nan
Episode: 2005 meanReward: 9.4600 meanLoss: nan
Episode: 2006 meanReward: 9.4700 meanLoss: nan
Episode: 2007 meanReward: 9.4900 meanLoss: nan
Episode: 2008 meanReward: 9.4900 meanLoss: nan
Episode: 2009 meanReward: 9.5100 meanLoss: nan
Episode: 2010 meanReward: 9.5100 meanLoss: nan
Episode: 2011 meanReward: 9.5000 meanLoss: nan
Episode: 2012 meanReward: 9.4900 meanLoss: nan
Episode: 2013 meanReward: 9.4900 meanLoss: nan
Episode: 2014 meanReward: 9.4800 meanLoss: nan
Episode: 2015 meanReward: 9.4800 meanLoss: nan
Episode: 2016 meanReward: 9.4700 meanLoss: nan
Episode: 2017 meanReward: 9.4400 meanLoss: nan
Episode: 2018 meanReward: 9.4300 meanLoss: nan
Episode: 2019

Episode: 2173 meanReward: 9.4300 meanLoss: nan
Episode: 2174 meanReward: 9.4100 meanLoss: nan
Episode: 2175 meanReward: 9.4100 meanLoss: nan
Episode: 2176 meanReward: 9.4200 meanLoss: nan
Episode: 2177 meanReward: 9.4000 meanLoss: nan
Episode: 2178 meanReward: 9.4000 meanLoss: nan
Episode: 2179 meanReward: 9.4100 meanLoss: nan
Episode: 2180 meanReward: 9.3900 meanLoss: nan
Episode: 2181 meanReward: 9.3800 meanLoss: nan
Episode: 2182 meanReward: 9.3900 meanLoss: nan
Episode: 2183 meanReward: 9.3900 meanLoss: nan
Episode: 2184 meanReward: 9.3800 meanLoss: nan
Episode: 2185 meanReward: 9.3700 meanLoss: nan
Episode: 2186 meanReward: 9.3900 meanLoss: nan
Episode: 2187 meanReward: 9.3900 meanLoss: nan
Episode: 2188 meanReward: 9.3900 meanLoss: nan
Episode: 2189 meanReward: 9.3900 meanLoss: nan
Episode: 2190 meanReward: 9.3900 meanLoss: nan
Episode: 2191 meanReward: 9.3800 meanLoss: nan
Episode: 2192 meanReward: 9.3800 meanLoss: nan
Episode: 2193 meanReward: 9.3600 meanLoss: nan
Episode: 2194

Episode: 2348 meanReward: 9.3300 meanLoss: nan
Episode: 2349 meanReward: 9.3400 meanLoss: nan
Episode: 2350 meanReward: 9.3500 meanLoss: nan
Episode: 2351 meanReward: 9.3500 meanLoss: nan
Episode: 2352 meanReward: 9.3400 meanLoss: nan
Episode: 2353 meanReward: 9.3200 meanLoss: nan
Episode: 2354 meanReward: 9.3200 meanLoss: nan
Episode: 2355 meanReward: 9.3100 meanLoss: nan
Episode: 2356 meanReward: 9.3100 meanLoss: nan
Episode: 2357 meanReward: 9.3100 meanLoss: nan
Episode: 2358 meanReward: 9.3100 meanLoss: nan
Episode: 2359 meanReward: 9.3000 meanLoss: nan
Episode: 2360 meanReward: 9.3100 meanLoss: nan
Episode: 2361 meanReward: 9.3100 meanLoss: nan
Episode: 2362 meanReward: 9.3300 meanLoss: nan
Episode: 2363 meanReward: 9.3300 meanLoss: nan
Episode: 2364 meanReward: 9.3400 meanLoss: nan
Episode: 2365 meanReward: 9.3300 meanLoss: nan
Episode: 2366 meanReward: 9.3400 meanLoss: nan
Episode: 2367 meanReward: 9.3500 meanLoss: nan
Episode: 2368 meanReward: 9.3400 meanLoss: nan
Episode: 2369

Episode: 2523 meanReward: 9.3900 meanLoss: nan
Episode: 2524 meanReward: 9.3800 meanLoss: nan
Episode: 2525 meanReward: 9.3700 meanLoss: nan
Episode: 2526 meanReward: 9.3600 meanLoss: nan
Episode: 2527 meanReward: 9.3600 meanLoss: nan
Episode: 2528 meanReward: 9.3600 meanLoss: nan
Episode: 2529 meanReward: 9.3600 meanLoss: nan
Episode: 2530 meanReward: 9.3400 meanLoss: nan
Episode: 2531 meanReward: 9.3500 meanLoss: nan
Episode: 2532 meanReward: 9.3400 meanLoss: nan
Episode: 2533 meanReward: 9.3400 meanLoss: nan
Episode: 2534 meanReward: 9.3200 meanLoss: nan
Episode: 2535 meanReward: 9.3100 meanLoss: nan
Episode: 2536 meanReward: 9.3200 meanLoss: nan
Episode: 2537 meanReward: 9.3400 meanLoss: nan
Episode: 2538 meanReward: 9.3400 meanLoss: nan
Episode: 2539 meanReward: 9.3300 meanLoss: nan
Episode: 2540 meanReward: 9.3200 meanLoss: nan
Episode: 2541 meanReward: 9.3200 meanLoss: nan
Episode: 2542 meanReward: 9.3100 meanLoss: nan
Episode: 2543 meanReward: 9.3000 meanLoss: nan
Episode: 2544

Episode: 2698 meanReward: 9.3800 meanLoss: nan
Episode: 2699 meanReward: 9.3600 meanLoss: nan
Episode: 2700 meanReward: 9.3500 meanLoss: nan
Episode: 2701 meanReward: 9.3600 meanLoss: nan
Episode: 2702 meanReward: 9.3700 meanLoss: nan
Episode: 2703 meanReward: 9.3800 meanLoss: nan
Episode: 2704 meanReward: 9.3600 meanLoss: nan
Episode: 2705 meanReward: 9.3600 meanLoss: nan
Episode: 2706 meanReward: 9.3600 meanLoss: nan
Episode: 2707 meanReward: 9.3500 meanLoss: nan
Episode: 2708 meanReward: 9.3400 meanLoss: nan
Episode: 2709 meanReward: 9.3400 meanLoss: nan
Episode: 2710 meanReward: 9.3600 meanLoss: nan
Episode: 2711 meanReward: 9.3900 meanLoss: nan
Episode: 2712 meanReward: 9.3800 meanLoss: nan
Episode: 2713 meanReward: 9.3700 meanLoss: nan
Episode: 2714 meanReward: 9.3700 meanLoss: nan
Episode: 2715 meanReward: 9.3800 meanLoss: nan
Episode: 2716 meanReward: 9.3700 meanLoss: nan
Episode: 2717 meanReward: 9.3600 meanLoss: nan
Episode: 2718 meanReward: 9.3600 meanLoss: nan
Episode: 2719

Episode: 2873 meanReward: 9.3600 meanLoss: nan
Episode: 2874 meanReward: 9.3500 meanLoss: nan
Episode: 2875 meanReward: 9.3500 meanLoss: nan
Episode: 2876 meanReward: 9.3600 meanLoss: nan
Episode: 2877 meanReward: 9.3700 meanLoss: nan
Episode: 2878 meanReward: 9.3900 meanLoss: nan
Episode: 2879 meanReward: 9.3800 meanLoss: nan
Episode: 2880 meanReward: 9.3700 meanLoss: nan
Episode: 2881 meanReward: 9.3800 meanLoss: nan
Episode: 2882 meanReward: 9.3800 meanLoss: nan
Episode: 2883 meanReward: 9.3800 meanLoss: nan
Episode: 2884 meanReward: 9.3900 meanLoss: nan
Episode: 2885 meanReward: 9.4000 meanLoss: nan
Episode: 2886 meanReward: 9.4000 meanLoss: nan
Episode: 2887 meanReward: 9.4000 meanLoss: nan
Episode: 2888 meanReward: 9.3900 meanLoss: nan
Episode: 2889 meanReward: 9.3800 meanLoss: nan
Episode: 2890 meanReward: 9.3900 meanLoss: nan
Episode: 2891 meanReward: 9.4000 meanLoss: nan
Episode: 2892 meanReward: 9.3900 meanLoss: nan
Episode: 2893 meanReward: 9.3900 meanLoss: nan
Episode: 2894

Episode: 3048 meanReward: 9.2600 meanLoss: nan
Episode: 3049 meanReward: 9.2600 meanLoss: nan
Episode: 3050 meanReward: 9.2600 meanLoss: nan
Episode: 3051 meanReward: 9.2700 meanLoss: nan
Episode: 3052 meanReward: 9.2900 meanLoss: nan
Episode: 3053 meanReward: 9.3000 meanLoss: nan
Episode: 3054 meanReward: 9.3000 meanLoss: nan
Episode: 3055 meanReward: 9.2900 meanLoss: nan
Episode: 3056 meanReward: 9.3100 meanLoss: nan
Episode: 3057 meanReward: 9.2900 meanLoss: nan
Episode: 3058 meanReward: 9.2900 meanLoss: nan
Episode: 3059 meanReward: 9.3000 meanLoss: nan
Episode: 3060 meanReward: 9.2900 meanLoss: nan
Episode: 3061 meanReward: 9.3000 meanLoss: nan
Episode: 3062 meanReward: 9.3000 meanLoss: nan
Episode: 3063 meanReward: 9.3100 meanLoss: nan
Episode: 3064 meanReward: 9.2900 meanLoss: nan
Episode: 3065 meanReward: 9.2800 meanLoss: nan
Episode: 3066 meanReward: 9.2900 meanLoss: nan
Episode: 3067 meanReward: 9.2800 meanLoss: nan
Episode: 3068 meanReward: 9.2900 meanLoss: nan
Episode: 3069

Episode: 3223 meanReward: 9.2900 meanLoss: nan
Episode: 3224 meanReward: 9.2800 meanLoss: nan
Episode: 3225 meanReward: 9.2900 meanLoss: nan
Episode: 3226 meanReward: 9.2900 meanLoss: nan
Episode: 3227 meanReward: 9.2900 meanLoss: nan
Episode: 3228 meanReward: 9.2900 meanLoss: nan
Episode: 3229 meanReward: 9.2700 meanLoss: nan
Episode: 3230 meanReward: 9.2800 meanLoss: nan
Episode: 3231 meanReward: 9.2800 meanLoss: nan
Episode: 3232 meanReward: 9.2900 meanLoss: nan
Episode: 3233 meanReward: 9.2800 meanLoss: nan
Episode: 3234 meanReward: 9.2800 meanLoss: nan
Episode: 3235 meanReward: 9.2800 meanLoss: nan
Episode: 3236 meanReward: 9.2900 meanLoss: nan
Episode: 3237 meanReward: 9.2900 meanLoss: nan
Episode: 3238 meanReward: 9.2800 meanLoss: nan
Episode: 3239 meanReward: 9.2900 meanLoss: nan
Episode: 3240 meanReward: 9.2800 meanLoss: nan
Episode: 3241 meanReward: 9.3000 meanLoss: nan
Episode: 3242 meanReward: 9.2900 meanLoss: nan
Episode: 3243 meanReward: 9.2900 meanLoss: nan
Episode: 3244

Episode: 3398 meanReward: 9.3500 meanLoss: nan
Episode: 3399 meanReward: 9.3600 meanLoss: nan
Episode: 3400 meanReward: 9.3600 meanLoss: nan
Episode: 3401 meanReward: 9.3600 meanLoss: nan
Episode: 3402 meanReward: 9.3900 meanLoss: nan
Episode: 3403 meanReward: 9.3900 meanLoss: nan
Episode: 3404 meanReward: 9.3700 meanLoss: nan
Episode: 3405 meanReward: 9.3600 meanLoss: nan
Episode: 3406 meanReward: 9.3600 meanLoss: nan
Episode: 3407 meanReward: 9.3500 meanLoss: nan
Episode: 3408 meanReward: 9.3500 meanLoss: nan
Episode: 3409 meanReward: 9.3600 meanLoss: nan
Episode: 3410 meanReward: 9.3700 meanLoss: nan
Episode: 3411 meanReward: 9.3600 meanLoss: nan
Episode: 3412 meanReward: 9.3500 meanLoss: nan
Episode: 3413 meanReward: 9.3700 meanLoss: nan
Episode: 3414 meanReward: 9.3700 meanLoss: nan
Episode: 3415 meanReward: 9.3900 meanLoss: nan
Episode: 3416 meanReward: 9.3800 meanLoss: nan
Episode: 3417 meanReward: 9.3900 meanLoss: nan
Episode: 3418 meanReward: 9.3900 meanLoss: nan
Episode: 3419

Episode: 3573 meanReward: 9.4000 meanLoss: nan
Episode: 3574 meanReward: 9.4100 meanLoss: nan
Episode: 3575 meanReward: 9.4200 meanLoss: nan
Episode: 3576 meanReward: 9.4100 meanLoss: nan
Episode: 3577 meanReward: 9.4100 meanLoss: nan
Episode: 3578 meanReward: 9.4000 meanLoss: nan
Episode: 3579 meanReward: 9.3800 meanLoss: nan
Episode: 3580 meanReward: 9.3700 meanLoss: nan
Episode: 3581 meanReward: 9.3500 meanLoss: nan
Episode: 3582 meanReward: 9.3600 meanLoss: nan
Episode: 3583 meanReward: 9.3600 meanLoss: nan
Episode: 3584 meanReward: 9.3600 meanLoss: nan
Episode: 3585 meanReward: 9.3500 meanLoss: nan
Episode: 3586 meanReward: 9.3700 meanLoss: nan
Episode: 3587 meanReward: 9.3600 meanLoss: nan
Episode: 3588 meanReward: 9.3500 meanLoss: nan
Episode: 3589 meanReward: 9.3500 meanLoss: nan
Episode: 3590 meanReward: 9.3300 meanLoss: nan
Episode: 3591 meanReward: 9.3200 meanLoss: nan
Episode: 3592 meanReward: 9.3200 meanLoss: nan
Episode: 3593 meanReward: 9.3100 meanLoss: nan
Episode: 3594

Episode: 3748 meanReward: 9.5000 meanLoss: nan
Episode: 3749 meanReward: 9.5000 meanLoss: nan
Episode: 3750 meanReward: 9.5100 meanLoss: nan
Episode: 3751 meanReward: 9.5000 meanLoss: nan
Episode: 3752 meanReward: 9.5100 meanLoss: nan
Episode: 3753 meanReward: 9.5300 meanLoss: nan
Episode: 3754 meanReward: 9.5300 meanLoss: nan
Episode: 3755 meanReward: 9.5300 meanLoss: nan
Episode: 3756 meanReward: 9.5200 meanLoss: nan
Episode: 3757 meanReward: 9.5100 meanLoss: nan
Episode: 3758 meanReward: 9.5000 meanLoss: nan
Episode: 3759 meanReward: 9.5000 meanLoss: nan
Episode: 3760 meanReward: 9.5200 meanLoss: nan
Episode: 3761 meanReward: 9.5200 meanLoss: nan
Episode: 3762 meanReward: 9.5200 meanLoss: nan
Episode: 3763 meanReward: 9.5200 meanLoss: nan
Episode: 3764 meanReward: 9.5100 meanLoss: nan
Episode: 3765 meanReward: 9.5000 meanLoss: nan
Episode: 3766 meanReward: 9.4900 meanLoss: nan
Episode: 3767 meanReward: 9.5000 meanLoss: nan
Episode: 3768 meanReward: 9.5100 meanLoss: nan
Episode: 3769

Episode: 3923 meanReward: 9.3400 meanLoss: nan
Episode: 3924 meanReward: 9.3400 meanLoss: nan
Episode: 3925 meanReward: 9.3200 meanLoss: nan
Episode: 3926 meanReward: 9.3200 meanLoss: nan
Episode: 3927 meanReward: 9.3300 meanLoss: nan
Episode: 3928 meanReward: 9.3200 meanLoss: nan
Episode: 3929 meanReward: 9.3300 meanLoss: nan
Episode: 3930 meanReward: 9.3300 meanLoss: nan
Episode: 3931 meanReward: 9.3100 meanLoss: nan
Episode: 3932 meanReward: 9.2900 meanLoss: nan
Episode: 3933 meanReward: 9.2900 meanLoss: nan
Episode: 3934 meanReward: 9.2800 meanLoss: nan
Episode: 3935 meanReward: 9.3000 meanLoss: nan
Episode: 3936 meanReward: 9.3000 meanLoss: nan
Episode: 3937 meanReward: 9.2900 meanLoss: nan
Episode: 3938 meanReward: 9.2900 meanLoss: nan
Episode: 3939 meanReward: 9.2900 meanLoss: nan
Episode: 3940 meanReward: 9.3000 meanLoss: nan
Episode: 3941 meanReward: 9.3100 meanLoss: nan
Episode: 3942 meanReward: 9.3100 meanLoss: nan
Episode: 3943 meanReward: 9.3200 meanLoss: nan
Episode: 3944

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

## Testing

Let's checkout how our trained agent plays the game.

In [184]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model-seq.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
    state = env.reset()
    total_reward = 0
    while True:
        env.render()
        action_logits, initial_state = sess.run([model.actions_logits, model.final_state],
                                                feed_dict = {model.states: state.reshape([1, -1]), 
                                                             model.initial_state: initial_state})
        action = np.argmax(action_logits)
        state, reward, done, _ = env.step(action)
        total_reward += reward
        if done:
            break
print('total_reward:{}'.format(total_reward))
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward:120.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.